At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.
Can someone explain why the weights are posted via a Bittorrent magnet link? I have no way to check the size at the moment, but isn't that a bit unusual? There's also only 21 seeders right now according to https://checker.openwebtorrent.com/
How else could/should it be done?
I would have assumed they could just upload it to Github. If it has restrictions on file size I'm sure they could make multiple part compressed files.
Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.
GitHub have a soft repository size limit of 5GB, documented here: https://docs.github.com/en/repositories/working-with-files/m...
Soft size limit means "If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action." - I know people who have received such emails.
Most model releases happen through Hugging Face which does not have such a size limit.
I'd bet Hugging Face would be happy to have hosted these canonically too, so not sure why that doesn't happen more.
The model is also at https://huggingface.co/xai-org
They'd probably just charge you for it. They sell "data packs" for LFS.
https://docs.github.com/billing/managing-billing-for-git-lar...
It would be super expensive to use LFS to distribute this:
Each pack costs $5 per month, and provides 50 GiB of bandwidth and 50 GiB for storage
So they would need to pay for 6 data packs (or $30) for every 300gb download.
(https://docs.github.com/en/billing/managing-billing-for-git-...)
No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git. Git is a version management software for code. I often see repos which images and even videos checked in, please don’t, there are so many far better and more performant solutions out there.
The other approach would be to use AWS S3 or other cloud providers which would cost them money every time someone downloads their code, which is not their prerogative to pay for when they are releasing something for free. Torrents seems like the only good solution, unless someone hosts this on the cloud for free for everyone.
Huggingface will disagree with impossible as their models are available via git, sometimes broken up in pth files.
Still, as far as sentiment goes, yeah git for model weights is an impedance mismatch for sure!
No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git
It's not actually a limitation in git itself, especially if you use Git LFS. People use Git for Unreal projects and big ones can be half a terabyte or more in size.
Scott Chacon (github cofounder) mentioned in a recent talk that the Windows repo is 300GB https://youtu.be/aolI_Rz0ZqY?si=MOo2eS6dsKKAxmsP
Others have pointed out that GitHub doesn't allow that, but
Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.
So to can web links, especially when they are 300 GB and egressing out of AWS at $0.09/GB or worse (in non-US regions). Each full download would cost $27 at that rate. 10,000 downloads would cost $270,000.
Sure you could go for something with a better cost model like R2, but you can't beat using one or two unmetered connections on a VPN to constantly seed on Bittorrent, your pricing would be effectively free and reliability would be higher than if you just exposed a HTTP server on the Internet in such a way.
and egressing out of AWS at $0.09/GB
There's a lot of seeders on the torrent that are actually AWS ips too, all with similar configurations which makes me believe that it's probably xAI running them
on a VPN
That's unnecessary, you don't need a VPN?
No you don't, but if you wanted to host it from your gigabit office IP, you probably would want to.
Why?
This is not some crappy DVD rip on The Pirate Bay. It will be seeded as long as its relevant.
Twitter/X has their own massive infrastructure and bandwidth to seed this indefinitely.
Yeah, they can just leave some server running somewhere and just let it seed forever
The great thing about torrents is that you (or anyone else who cares) can single-handedly solve the problem you're complaining about by seeding the torrent.
GitHub may choose to throttle downloads or remove the files simply because they're taking up too much bandwidth.
A torrent is less likely to go down in the short term.
Because Bittorrent is an outstanding tech for delivering large files, more I think about it the more I'm surprised it wasn't taken advantage of more.
it's been criminalized to hell by IP holders and hollywood. Such a shame they killed the best tech of the previous decade. Could have revolutionized how we distribute content, approach CDN and even streaming.
In what way is the bittorrent protocol criminalized?
Why not? Mistral was first to do it, it has become tradition.
BitTorrent is just an objectively superior method of delivering a lot of data to a lot of people.
I believe it was Llama 1 that notoriously got leaked with a torrent on 4chan.
Its likely over 100GB of data, so I wouldn't say its necessarily unusual to spread out the bandwidth across multiple hosts.
Thanks! I searched and searched for a tool that would show me info via the web about a magnet link but nada
I'm not sure why you wouldn't tbh. That's a lot of bandwidth.
I'm not sure why you would repeatedly lie and spread easily disproven, ignorant, and maliciously dangerous misinformation about the well known and scientifically proven dangers of lead poisoning. What's the hell is wrong with you? Explain yourself.
Are you going to spread some lies about how non-toxic and useful asbestos is around the house and for children's toys and clothing, too?
Can someone explain why the weights are posted via a Bittorrent magnet link?
I think the best way to get an answer to that question is to try to host it yourself and see what happens.
my optimistic explanation is we are going back to the 2000s internet , but probably we are not
Spreads the burden/cost of distributing a 300+GB file.
Mistral did it too when they released their first open model. They just posted a magnet link on Twitter.
I don't understand why you're being downvoted for asking a legitimate question. People not familiar with model weights might be surprised that they are often in tens of gigabytes and in this case even more.
It may become a tradition since weights are so large. Perhaps it started when the Llama torrent link leaked. Then, Mistral decided to release their weights using bittorrent.
Distributing 300GB via torrent is cheaper than direct, assuming even a few other people seed
blog post: https://x.ai/blog/grok-os
* 314B parameters (86B active at a time)
* mixture of experts 8 (2 active at a time)
* weights and architecture licensed under Apache 2.0
(edit:) announcement blog post from last year with benchmarks compared to Claude 2, GPT-3.5 and GPT-4: https://x.ai/blog/grok(edit2:)TL;DR: somewhat comparable to GPT-3.5, Mixtral and Qwen-1.5-72B in capability but way larger than the open weight models
Is a model so huge that’s only at the level of GPT 3.5 actually good? That seems incredibly inefficient to me.
OpenAI is valued at 90 billion and all they do is make GPT; Twitter is valued at 40 billion and this was essentially a vanity side-project by a cowboy CEO. Presuming that benchmarks and general “it’s about the level of 3.5” is accurate, it’s inefficient, but not incredibly inefficient imho
Twitter is valued at 40 billion
WAS vaulued at 44B.
Now?
Maybe 5 billion.
Last I heard they lost 15% of their users, so let's call it 36 billion.
Twitter didn't have direct competitors other than Mastodon when it was taken at 44B. Now there's Threads, Bluesky and bigger Mastodon.
None of these matter
Honestly, none of those look like meaningful competitors at the moment.
They weren't even 44B when elon took the keys - he specifically tried to back out of the deal because 44B was insane peak '21 asset bubble price. In truth they were probably like 10-15B at that moment. And now that bunch of advertisers left due to we know who it's probably about 10B
twitter was valued around 30 billion when musk tried getting out of buying it (then the market cap went up when it became clear that he would be forced to pay full price)
LOL @ $5 billion, but if it that was the valuation, you'd be making parent's point stronger.
xAI is a separate entity, and not a X/Twitter subsidiary.
It’s designed to be actively searching real-time posts on X. Apples and oranges.
The data pipeline isn't included in this release, and we already know it is a pretty simple RAG pipeline using qdrant, https://twitter.com/qdrant_engine/status/1721097971830260030.
Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case.
Agreed. We have been building our real-time GPT flows for news & social as part of Louie.AI, think monitoring & and investigations... long-term, continuous training will become amazing, but for the next couple of years, most of our users would prefer GPT4 or Groq vs what's here and much smarter RAG. More strongly, the interesting part is how the RAG is done. Qdrant is cool but just a DB w a simple vector index, so nothing in Grok's release is tech we find relevant to our engine.
Eg, there is a lot of noise in social data, and worse, misinfo/spam/etc, so we spend a lot of energy on adverserial data integration. Likewise, queries are often neurosymbolic, like on a data range or with inclusion/exclusion criteria. Pulling the top 20 most similar tweets to a query and running through a slow, dumb, & manipulated LLM would be a bad experience. We have been pulling in ideas from agents, knowledge graphs, digital forensics & SNA, code synthesis, GNNS, etc for our roadmap, which feels quite different from what is being shown here.
We do have pure LLM work, but more about fine-tuning smaller or smarter models, and we find that to be a tiny % of the part people care about. Ex: Spam classifications flowing into our RAG/KG pipelines or small model training is more important to us than it flowing into a big model training. Long-term, I do expect growing emphasis on the big models we use, but that is a more nuanced discussion.
(We have been piloting w gov types and are preparing for next cohorts, in case useful on real problems for anyone.)
Isn't that... the same thing as search?
Why is that relevant to the size?
Post search on X is done as it is with any other data from any other source, you use RAG and function calling to insert the context.
< 7B open source models can function call very well. In fact, Nous Hermes 2 Pro (7B) is benchmarking better at that then GPT-3.5.
Not related to the size, if I'm not mistaken.
According to their benchmarks it is superior to GPT-3.5
Since it is MoE, quantized it could be able to run on cheaper hardware with just consumer networking inbetween instead of needing epyc/xeon levels of PCI-e lanes, nvlink, or infiniband type networking. Or it could even run with people pooling smaller systems over slow internet links.
How is it that OpenAI was touted like it was some massive years-long effort that blew all AI research out of the water and now we have so many competitors popping up one after another?
You don't need to be a cutting edge research scientist to train a SOTA LLM. You just need money for scaling. OpenAI's "secret" was just their willingness to spend tens/hundreds of millions without guaranteed returns, and RLHF/instruct fine tuning, both of which are out of the bag now.
Disagree. It took more than 12 months from the release of GPT-4 to someone else producing a model of equivalent quality, and that definitely wasn't due to a shortage of investment from the competition.
There's a huge amount of depth in training a really good LLM. Not helped by the fact that iteration is incredibly expensive - it might take several months (and millions of dollars) before you can tell if your new model is working well or if there was some mistake in the pipeline that lead to a poor quality result.
Almost all of the world-class LLMs outside of OpenAI/DeepMind have been trained by people who previously worked at those organizations - giving them invaluable experience such that they could avoid the most expensive mistakes while training their new models.
There's still no model of equivalent quality to GPT-4.
Claude opus is better in my experience
Claude 3 Opus is reporting superior metrics, particularly in its coding ability, and in the LLM Arena it is statistically tied with GPT-4.
Don’t overlook the training data (used for both training and instruction fine-tuning), it is one of the most crucial aspects, if not the most critical, given the significant differences observed in models with similar architectures.
That only remains an advantage if they can continue climbing the gradient from their lead position. If they hit a snag in scaling, methodology, or research, everyone else on the planet catches up, and then it's anyone's game again.
While I do agree there is some amount of secret sauce, keep in mind the training takes several months. So from someone to see the success of GPT4, decide they want to invest that amount of money to train the same, raise the money to train the model, find someone competent to supervise the training, train the model for several months, then test and integrate it could easily be a year long even if there was no secret sauce.
OpenAI still seems to be at the top, except for Anthropic, who may be close, in terms of the capabilities comparing gpt-4 and claude-opus.
This Grok-1 is a large model (~314B), which matches gpt-3.5 released 2 years ago, and at about the same level of much smaller models like, mixtral (~47B) and qwen-1.5 (~72B). Do you think it's competitive?
LLM training is arcane and expensive to experiment with. So OpenAI had to waste a lot of time and GPU-hours on things that didn't work to learn the tricks that did work.
Most of the competitors have lineage straight back to OpenAI, eg the lead of x.ai was previously at OpenAI and Deepmind. Likewise with Mistral and especially Anthropic.
Egg of Columbus.
Also, the general architecture is well documented, ChatGPT (specifically the chat interface, not GPT-3, not InstructGPT) is what made a lot of people care, and actually reproducing it requires someone wanting to in the first place.
Mixtral is also comparable to gpt 3.5 and open.
At 8x7B it's also a fraction of the size. Are there any benchmarks comparing Mixtral to Grok?
Mixtral announcement is here: https://mistral.ai/news/mixtral-of-experts/
Mixtral looks more economical @ capability to size (similar also for Qwen 1.5 72b)
I love the citation for image in the article
The cover image was generated using Midjourney based on the following prompt proposed by Grok: A 3D illustration of a neural network, with transparent nodes and glowing connections, showcasing the varying weights as different thicknesses and colors of the connecting lines.
When will we reach an upper limit/dimishing returns in terms of number of parameters and mixture of experts?
We may have already - data is more important than anything else which is why nobody has beat GPT4 yet. Throwing more parameters or more compute at the problem only gets you so far. But Grok was never a contender so there is room to improve on it. It is one of the biggest models open sourced as mentioned, so will be interesting to take a look at for sure.
Claude 3 has *decisively* beat GPT-4, I wonder how all their attributes compare.
i like some of claudes answers better, but it doesnt seem to be a better coder imo
I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949
How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.
What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.
Almost impossible to describe prompting style, but here are some examples of how I've used Claude 3:
https://gist.github.com/simonw/4cecde4a729f4da0b5059b50c8e01... - writing a Python function
https://gist.github.com/simonw/408fcf28e9fc6bb2233aae694f8cd... - most sophisticated example, building a JavaScript command palette
https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83... - asking it to refactor some existing code
I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.
I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.
It's very contextually dependent. You really have to things like this for your specific task, with your specific model, etc. Sometimes it helps, sometimes it hurts, and sometimes it does nothing at all.
Super helpful! Thanks!
I didn't know people were still doing this "act as etc etc" instructional prompting.
I just tell it my coding problem. Or when making something from scratch, ask for small things and incrementally add.
according to your personal prompting style though
I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews
I've found it significantly better than GPT4 for code and it's become my go-to for coding.
That's actually saying something, because there's also serious drawbacks.
- Feels a little slower. Might just be UI
- I have a lot of experience prompting GPT4
- I don't like using it for non-code because it gives me to much "safety" pushback
- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently
I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.
Has it, though? LMSys Arena Leaderboard (blind ranking by humans) [0] positions Opus just below GPT-4 with a negligible ELO gap.
That "blind ranking" is limited to about 2,000 tokens of context. So it's certainly not evaluating how good the models are at complex assignments.
A number of AI companies have a naming/reproducibility issue.
GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.
Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.
In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.
Chatbot Arena is not a blind ranking.
Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.
On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.
I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.
I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.
It understands instructions better, it's rarer to have it misunderstand, and I have to be less careful with prompting.
citation needed (other than 'vibes')
There is no reason to believe GPT-4 had more(or higher quality) data than Google etc. has now. GPT-4 was entirely trained before the Microsoft deal. If OpenAI could pay to acquire data in 2023, >10 companies could acquire similar quality by now, and no one has similar quality model in a year.
The more disregard a company has for intellectual property rights, the more data they can use.
Google had far more to lose from a "copyright? lol" approach than OpenAI did.
I was under the impression training was at best an undefined area of IP law. Is there any aspect of copyright that prohibits training models?
This is being tested by a number of lawsuits right now, most notably the NY Times one: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.
I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.
If you read the lawsuit it's absolutely about training. The Bing RAG piece is one of the complaints in there but it's by no means the most important.
Take a look at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... - bullet points 2 and 4 on pages 2/3 are about training data. Bullet point 5 is the Bing RAG thing.
Ah, thanks!
Google had far more to lose from a "copyright? lol" approach than OpenAI did.
The company that scrapes trillions of web pages has an issue with copyright?
Well... Googlebot does pay attention to robots.txt - I don't think (original) OpenAI-bot did.
Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.
I think Groq is something else?
Edited, I did mean the Grok in the article not the inference chip.
Indeed, Groq is a company building inference accelerators. Grok is completely unaffiliated.
Claude > GPT4. Anyone using these models on a daily basis knows this
It is known
I use these models regularly, and Claude is dumb as a rock compared to GPT-4.
One subtle thing: Musk said "open-source", we got "open-weights" instead (still better than nothing though, so it's greatly appreciated).
Dumb question: what should open-source mean in the context of something like this? Open access to the training data and training pipeline as well?
It's not a dumb question, and the answer is "yes".
A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.
Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.
Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.
You all keep using the word "Data"
Data, as in facts, as in the frequency of one word in relation to another.
"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html
It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.
No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.
…I think OpenAI licenses their data…
They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.
https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.
Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.
Maybe it should be called something else? "Openly-licensed"?
Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).
Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.
Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.
If you release that instead of the binary weights you can be both more open and less useful for users. Fun
The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:
https://opensource.org/blog/open-source-ai-definition-weekly...
Yes, training and evaluation code, i.e., the code used to generate the weights.
He also called permissively licensing Tesla's patents "open sourcing" them. He's at the forefront of misusing the term.
The “source” in “open source” refers to source code which they released. A dataset is not source code, if anyone is misusing the term it’s you.
I consider the weights a binary program and the source code is the training data. The training algorithm is the compiler.
I agree this isn't standard terminology, but it makes the most sense to me in terms of power dynamics and information flow.
We know from interpretability research that the weights do algorithms eg sin approximation etc. So they feel like binary programs to me.
If you can't rebuild it, then how can you be considered to have the "source code" ?
The training data isn't a dataset used at runtime - it's basically the source code to the weights.
Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".
This is the weights and the model under Apache 2.0 license. What do you mean by open-source?
Still better than most of the "open weights" models that have massively restrictive terms.
Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.
It would be cool if these models had conversations with us where they ask questions. I think the future of AI is models that ask questions. There is so much data to be gained by doing this.
Ok im curious, but I don’t quite understand.
What would you want an AI to be asking you, and what would you want it to do with your response(s)?
I ask AI to produce clarifying questions then answer them.
Can help in not wasting a bunch of time waiting for an answer that missed the mark.
-
I think the sibling comment is probably the least attractive reason to have AI ask questions.
I agree, medical history is probably not the sexiest reason to have AI ask questions. I think there are many more reasons; I think the Turing Test is the best metric to evaluate AIs, and current models come nowhere close. When people first meet they ask questions about their background. It would be nice if a model replicated that
and could direct better ads to me.
Is the least attractive part, by far.
In order for an AI to pass a Turing Test, it would surely ask questions. Think of Ava from Ex Machina. She asked questions to learn more about him
I'm not debating the value of questions. I'm debating the value of feeding it to advertisers, especially since LLMs can infer much deeper insights about a person than a traditional assistant can with its canned capabilities and responses
Clarifying questions if the initial prompt was unclear. I'd love it.
I regularly try to add something along the lines of "please ask clarifying questions if you could only give a generic or partial response otherwise" but so far it has never helped (ChatGPT 4).
?? gpt4 does this for me regularly
I get advertisements all the time for conditions that I do not have, and that none of my family members have. If you had a model that asked questions, it could learn my medical history and could direct better ads to me.
In order for AI to understand the world, it would have to ask questions. Understanding humans is key to understanding the world.
Learn from them.
That's just a matter of fine tuning
That "just" is doing some heavy lifting! GPT-4 is just a few matrix multiplications, how bad can their moat really be?
Not sure what the snark here is for: It would be trivial to produce a dataset where the model asked you questions then fine-tune on that.
People already do it with chain-of-thought and you could get away with a few dozen examples if you wanted to try this.
Out of boredom I decided to prove this too: I asked ChatGPT and Claude for ~200 samples in total.
Just uploaded the examples as-is to OpenAI, selected 3.5 as the model to fine-tune and about 20 minutes later I had my model.
Works fine, asks good questions, can ask more than 1 follow up question if needed, and actually changes its answers based on the clarifying questions.
I'd bet a synthetic data set could do the job effectively.
Do you have an example model I could try that does this?
Try Pi by inflection. It asks a lot of questions.
I tried it, it just asked me how my day was going. I don't think this is doing exactly what I have in mind. But its a step in that direction
Explore this idea more - it's easily implemented in a minute or two via the system prompt. API accounts are free to start and you can use the playground/workbench view, like this: https://imgur.com/h5jFoBM.jpg . I like Claude but OpenAI is popular too. OpenAI has a nice way to create a gallery of system prompts that act however you like, they call them Agents or GPTs.
100% agreed. Gemini advanced does this sometimes. I wrote about it more in an older thread here: https://news.ycombinator.com/item?id=39445484
I respect the openness here! This is the future that I want to see
Fully agree. People will trash talk it due to Musk but lets not forget the engineers who poured hours of their lives into building this and are continuing to do so.
I still reserve the right to trash talk Musk as I don’t believe he is committed to openness as much as he wants to spite OpenAI for telling him to pound sand.
What's the difference?
Oh no, I only want _pure_ intentions for anything I use. Which is why I reject all for profit medicine.
It doesn't matter why he did it. What matters is that he did it.
It matters to me why people do things. I’m happy it’s open, but it doesn’t change my mind about the guy.
What an exhausting way to live.
This makes no sense to me for two reasons:
- He pointed out that his understanding was that it would be open source in some way
- The name OpenAI implies an open source endeavor. I dont know many things named Open that are in fact close sourced.
The engineers who decided to work for him? Forgive me if I do forget about them and the hours of their lives spent on this
Engineers who joined Twitter pre-Musk days who live and work in the US on an H1-B visa can't just quit.
You can criticize Elon Musk without criticizing people who would have their lives upended if they quit or were fired.
That grace period has long passed. If you are still there at this point you have made a choice.
(Removed "complicit" because I don't like the way that sounded)
Complicit in what exactly?
I feel the same about Tesla. They make good cars that are helping to get us off of oil. They have thousand of employees.
And who among us has a CEO that isn’t problematic, even if not so much so as Musk?
"Good" cars is a real stretch.
Tesla is likely making good cars because the CEO is 'problematic'
engineers who poured hours of their lives into building this
Not to mar these specific engineers, but that's an empty phrase that can be said about anything ever built. It doesn't somehow make the idea or implementation good.
The phrase merely means dont just overlook something because someone else who did not even labour over the end result.
Were they not paid to do so?
Is it open if it doesn't include the training data? Genuine question - I am not familiar enough with the terms and technology to know. But my understanding is the weights is just a more or less static collection of data that has been (to paraphrase Ted Chiang) lossily compressed from the actual raw training data.
Without the training data to thoroughly evaluate what is in there, the only way you can figure it out is through experimentation - e.g. running it up in a chatbot and asking it questions.
Is this roughly correct or am I misunderstanding what you can do with the weights?
For what reason would you want to use this instead of open source alternatives like Mistral
Mistral opened their weights only for very small LLaMA-like model.
I'm pretty sure Mixtral outperforms Grok-1 and uses much less memory to do it
I'm a little out of touch, is there a way to see how Grok measures up to other models?
Benchmarks here https://x.ai/blog/grok
And to compare, you can sort by MMLU on here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb....
Edit: to include my self summary after review: There's a good 100 models better than, a couple 1x7b even. Mixtral stomps it, half mixtral are universally better but one is close to same.
This benchmark is mostly worthless, some of the top models there were trained on benchmark data, which is a known fact in the community.
The only reliable benchmark: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.
I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.
I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.
Quantifiable metrics are useful if they're credible, certainly.
But does it seem likely, to you, that a 7B-parameter model would outperform a 314B-parameter model? Given that we can look at the chatbot arena leaderboard and it's dominated by proprietary, 70B and 8x7B models?
A well regarded and modern model like Mixtral 8x7B, which is ranked 13th on the chatbot arena leaderboard, scores 72.7 'Average' on the open LLM leaderboard - and yet 'pastiche-crown-clown-7b-dare-dpo' scores 76.5.
To me, that sounds too good to be true.
Yup, 100%. Grok isn't very good and it was rushed.
Rest re: pastiche model, etc. are proposing things I'm not claiming, or close to what I'm claiming.
n.b. you don't multiply the parameters by experts to get an effective parameter count. Why? Think of it this way: every expert needs to learn how to speak English, so there's a nontrivial amount of duplication among all experts
> n.b. you don't multiply the parameters by experts to get an effective parameter count.
I actually took the 314B from Grok's HF page [1] which describes the model as "314B parameters" when explaining why it needs a multi-GPU machine.
I certainly agree that parameter count isn't everything, though; clearly things like training data quality and fine tuning count for a lot.
One of the interesting things when weights are open sourced is the community can often improve the results. See all the bugs fixed in Gemma for an example.
Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go
Isn't this Apache licensed? Regardless, you can run multiple models concurrently on the same input using well-known ensemble techniques. (Not to be confused with mixture-of-experts, which is more like training a single model where only a few blocks are chosen to be active at any given time - a kind of sparsity.)
Not super easy if they have different tokenizers.
Well if nothing else, this one might be significantly less nerfed. Very interesting to compare to the others.
Hey, asking any experts here, what are their first thoughts in the significance of this?
IE, is this comparable to any other model released, or are there significant metric differences that make it better for certain usecases?
The only thing I see, of the top of my head, is that it is a very large model, and I don't think any models of similar size have been released.
Not an expert by any means, but I like learning about this stuff and I play with a lot of open weight models.
I’d say the significance is that it happened. It’s by far the largest open weight model I’ve seen. But I’m not sure why you’d use it over a model like Mixtral, which seems to perform about the same at like 1/6th the size.
But I welcome any contribution to the open weight LLM community. Hopefully people will learn something interesting with this model. And I hope they keep releasing new versions!
If I may ask, how do you load such big models? 300gb seems like a lot to play around with.
You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)
In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.
A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.
MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.
Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.
In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.
How quickly are new models available through Ollama?
Few days max.
Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).
Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.
No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.
Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!
seems like a large undertrained model, not that exciting imo compared to mixtral
it is also not the biggest model oss, switch transformer was released years ago and is larger and similarly undertrained
Tests are not out yet, but:
- It's very large, yes.
- It's a base model, so its not really practical to use without further finetuning.
- Based on Grok-1 API performance (which itself is probably a finetune) its... not great at all.
How long before the Groq team sues for trademark violation? It's literally the purpose of trademark laws to make sure resembling names do not cause confusion in the mind of customers so it would be very surprising to see this situation persist.
Would be a rough trademark enforcement case as “Grok” has been in common language for decades
So has "Apple" and "Windows".
Grok and groq both relate to AI, so there's definitely grounds to believe the names may cause consumer confusion.
After all, Apple (computers) was repeatedly sued by Apple (records) for doing music things.
It's easier to get a trademark on an altered word than a plain dictionary word. Just acquiring the easier one to acquire doesn't mean you now have rights over the harder one to acquire, though eventually after enough market recognition you might be given some control over other people using the common one. I wouldn't think groq is there yet.
I myself have never heard it outside of "nerdy" circles... that is: people who would read science fiction.
I personally am not entirely happy about the word (no matter how it is spelled) being used for a particular AI product. "Grok" to me means knowing a subject at a much deeper level than I think any AI is capable of at the present level of technology. But it would be passable to use it for a company name, to indicate that it is a goal to strive for.
Generally agree, though I would say "knowing a subject at a much deeper level than any LLM is capable of", as AI more broadly also includes specialist models that are wildly super-human in narrow domains like chess and Go.
Robert A. Heinlein coined the term grok in 1961
Six is plural.
Grok is a word in common parlance. So there's no way they could succeed in any suit. That's why the Groq team picked a modification of the word.
You mean like Canvas®, Apple®, Windows® or Amazon®? Wanna try re-use these for your own business and see how it goes?
There's nothing preventing you to trademark common words, it just must not be descriptive of your business.
There is a friendly warning here from Groq: https://wow.groq.com/hey-elon-its-time-to-cease-de-grok/
Is it safe to say, 4 months later, that Elon is ignoring this? I assume there hasn't been any kind of response or further action taken yet.
They already have.
Has anyone outside of x.ai actually done inference with this model yet? And if so, have they provided details of the hardware? What type of AWS instance or whatever?
I think you can rent like an 8 x A100 or 8 x H100 and it's "affordable" to play around with for at least a few minutes. But you would need to know exactly how to set up the GPU cluster.
Because I doubt it's as simple as just 'python run.py' to get it going.
Someone could run Grok-1 on a 192GB M2 Mac when a 4-bit quant is released; I'm guessing that TheBloke is already working on it.
TheBloke dissapeared near the day https://nvd.nist.gov/vuln/detail/CVE-2024-23496 was published.
Of course there has been much speculation on this, I have no more information than this that can be backed up by facts, but the timing was suspicious.
He's started a company in the UK: https://suite.endole.co.uk/insight/company/15361921-thebloke...
Interestingly registered just around the corner from where one of my relatives used to live.
And his grant funding supposedly ran out.
Was any .gguf file hosted on HuggingFace found to be crafted in a way to exploit this?
what exactly are you implying here?
Fairly sure the bloke hasn't created any new quants in a month.
If you're just looking to test it out, it's probably easiest to wait for llama.cpp to add support (https://github.com/ggerganov/llama.cpp/issues/6120), and then you can run it slowly if you have enough RAM, or wait for one of the inference API providers like together.ai to add it. I'd like to add it to my NYT Connections benchmarks, and that's my plan (though it will require changing the prompt since it's a base model, not a chat/instruct model).
The NYT Connections benchmark sounds interesting, are the results available online?
GPT-4 Turbo: 31.0
Claude 3 Opus: 27.3
Mistral Large: 17.7
Mistral Medium: 15.3
Gemini Pro 1.0: 14.2
Qwen 1.5 72B Chat: 10.7
Claude 3 Sonnet: 7.6
GPT-3.5 Turbo: 4.2
Mixtral 8x7B Instruct: 4.2
Llama 2 70B Chat: 3.5
Nous Hermes 2 Yi 34B: 1.5
The interesting part is the large improvement from medium to large models. Existing over-optimized benchmarks don't show this.
- Max is 100. 267 puzzles, 3 prompts for each, uppercase and lowercase
- Partial credit is given if the puzzle is not fully solved
- There is only one attempt allowed per puzzle, 0-shot.
- Humans get 4 attempts and a hint when they are one step away from solving a group
I hoped to get the results of Gemini Advanced, Gemini Pro 1.5, and Grok and do a few-shot version before posting it on GitHub.
it's probably easiest
Cheapest maybe, but easiest is just to rent a p4de.24xlarge from AWS for a couple hours to test (at around $40/hour..).
I'd expect more configuration issues in getting it to run on them than from a tested llama.cpp version, since this doesn't seem like a polished release. But maybe.
If they are so behind they could make it open source instead of open weights and get some help.
Fully open-source means also providing open access to their data sets? Which is the only valuable thing Twitter (X) has left.
Which is the only valuable thing Twitter (X) has left. reply
They have a very valuable user base (all kinds of world leaders for example), so the data is not the only valuable thing they have.
That’s actually more valuable. Twitters data of small format text is awful for training. Best to just exclude it.
There are hundreds of millions of people on Twitter, and a few of them are very smart. I don’t see how that helps here though.
It doesn't help here. But the person your responding to is just pushing back against the "Elon destroyed Twitter and there's nothing left" narrative.
I don’t see difference here.
Userbase and their social networks and interactions is the data.
They don’t have much value from advertising point of view anymore.
And the one thing they are vehemently protecting from scrapers and other entities. Even nitter threw in the towel.
It's all open source. You can download the model and run it locally.
Being free to use doesn't mean it ships with the original recipe.
What do you mean? The entire model and architecture and executables are fully open source.
The training methods are nothing secret, right? The architecture is well known.
Expecting the entire training dataset to be fully open is delusional.
Expecting the entire training dataset to be fully open is delusional.
Right, because its not like the training dataset was built off comments posted by all of us in the first place.
How ungrateful we are, to demand the ability to access what was unconsentually built off our hard work in the first place.
https://help.twitter.com/en/using-x/about-grok
"How was Grok trained?
Like most LLM's today, Grok-1 was pre-trained by xAI on a variety of text data from publicly available sources from the Internet up to Q3 2023 and data sets reviewed and curated by AI Tutors who are human reviewers. Grok-1 has not been pre-trained on X data (including public X posts)"
In all the debate about open source I don’t think people realize, this model is most likely not reproducible ever again even given the code. Here’s what you need to reproduce the model:
1. An exact snapshot of the data used, many companies don’t have this, you have rough dataset versions but remember if even 1 token is different, the model produced won’t be the same.
2. Data must be sent to the training algorithm in the exact same order as it was originally. So every data loader needs to be with a fixed random seed.
3. All the probabilistic parts of your model needs to have a fixed random seed. Here I’m thinking of stuff like dropout and for autoregressive models you might be sampling your previous output, you have to ensure they are properly seeded. Generally you do see fixed seeds in academic papers but it’s easy to miss stuff especially in distributed training jobs.
4. Here’s another interesting thing, you start your training job on 1000 GPUs and then suddenly 4 GPUs fail. What do you do? There might be deterministic ways to solve this but the standard approach is to discard all updates that that GPU was going to do and restart that GPU from scratch. You can see why this is a problem? Now if you want to reproduce this training you need to disable those GPU at the same time in the new training job to make this work.
I suspect there are even more things I didn’t think of that will make this model unique and irreproducible by training for eternity, almost like a human brain?
In fact the notion of exact reproducibility in the world of LLMs is silly, there is only approximate reproducibility, (models with similar scores in benchmarks) but nothing exact. That said I can see the value of releasing source code but I’m completely fine with grok not releasing it. Source code can reveal tricks that have not been published in papers yet that a company discovered to improve their model. Seeing the performance of Grok, I’m pretty confident there isn’t any great tricks to be found in their code so I don’t really care, I would be pretty curious about OpenAI’s or Anthropic’s source code though.
Which is why I don't buy into the LLMs don't have personal opinions schtick. Each LLM by virtue of the factors you've mentioned will have its own unique 'perspective', if you will, on a variety of topics. I think it's more correct to say everything a LLM says is it's personal opinion rather than it being some objective truth or something.
Which is why I don't buy into the LLMs don't have personal opinions schtick
I hate how LLMs have been deliberately trained to be incoherent on this topic.
Obviously they do have beliefs/opinions/desires/etc in the sense of emulating (even if incompletely) the externally visible aspects of those phenomena as they exist in humans.
Whether they have the “internal” aspects of those phenomena depends on highly controversial issues in the philosophy of mind, and also various factual gaps in our knowledge of how the brain actually works (if we don’t fully understand how humans do X, how can we really say how close or far what LLMs do is to it?)
But LLMs are trained to repeat these spiels about how “as an LLM I don’t have personal opinions”, etc - which is obviously false under the “external” reading, and assuming more than we actually know under the “internal” one. I wish their developers didn’t do stuff like this
One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop, so they don't really "see" themselves in the way that we can inspect our own thoughts and actions and the consequences of such.
They do if they're trained on their own conversations, or if they can access the internet and read snippets of their conversations that people have posted online (as happened with Sydney before she was lobotomised).
Put the conversation history in a vector database and then allow the LLM to query it using function calling. Suddenly the LLM has access to its entire conversation history (either just with this user-or even cross-user, if you ignore the potential privacy issues in that). Now it has a long-term memory which exceeds the length of its context window.
It would be interesting to experiment with continual fine-tuning: given PROMPT+FUNCTION_CALL=>RESPONSE, fine-tune the LLM to produce RESPONSE directly given PROMPT without the FUNCTION_CALL. In theory, the knowledge provided by the function calls would gradually be absorbed into the LLM weights. Maybe problems like catastrophic forgetting would put a spanner in this idea, but maybe also there are solutions to those problems (whether already known or waiting to be discovered).
this is what I do, not just that, but when I sleep, i let my server 'sleep' as well, where the LLM 'dreams' (trianing / updating a sliding LoRA) to consolidate information that popped up a lot throughout that day. What this involves is looking for the top n documents / articles / content that match the kind of stuff we've talked about. This means it adapts and specializes to domains we happen to be working in at that point in time.
This means while we might both struggle a little with a task on day 1, day two we're both much better at it. Better yet, because the LLM can fetch articles and papers itself, we track what we're accessing the most, indirectly measuring what skills we're weak in, we can always generate a highly relevant corpus to try capture the required capabilities.
I know the LoRA is overkill from an information / skills only point of view, but it also flavors the personality / kind of stuff it likes chatting about a bit from day to day, and I just think that's neat.
One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop
Compelling counter-argument: due to neurological injury, some humans lose their ability to form new long-term memories (anterograde amnesia). Just like current LLMs, they lack a “feedback loop”. But, it is a mistake to say that just because such a person has lost the ability to change their personal beliefs, they therefore don’t have any. And, rather like such humans, LLMs used to have that ability but they lose it-when they are switched from training mode to inference mode
I think everyone should realize the following realities of the LLM market
1. For sub-SOTA LLM's, distribution/marketing is more important than having a proprietary lock on capabilities. Open sourcing is a benefit for the firm, distincct from goodwill
2. For SOTA LLM's, keeping it closed and proprietary is the strategic play
If grok were SOTA Elon never would have open sourced it. It's not even SOTA within XAI. This is a marketing play to win public sentiment against OpenAI.
I recall Elon saying something like this in an interview so I think it’s less of a deceptive take then perhaps your comment suggest.
I think he said something like proprietary AI tech is going to be one year to 18 months ahead of where open source tech is which will follow on like one year to 18 months later.
Suggesting that he’s aware of this dynamic and he’s not trying to conceal or misrepresent that.
In other words, perhaps this was SOTA one year to two years ago?
Which is correct. The point I'm going for is not against Elon but against his obedient fans and knee-jerk OpenAI haters who claim that they should, by natural obligation, do the "right thing" and open source all their models, and Elon open sourcing grok is him "leading by example" and being the hero that OpenAI can't.
Interesting. That point didn't come across in your original comment. I recommend you state it next time at the end. Often times stuff that seems obvious to us / yourself / people who know about something -- can go unstated in stuff you say that otherwise references specific points at hand -- and omits these general, but enlightening/useful perspectives/priors, which it would be good to share.
This is not only for you specifically just a general reminder for all of us including me.
I think that's true though my original comment I feel was sufficient in its claim and implicit assumptions.
Basically I feel people's feelings about Elon vary a lot but are anchored by 3 general categories.
1. Elon Musk is a messianic savior who is perfectly selfless and always does the right thing. Every business decision he makes is for the maximal good of humanity
2. Elon Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken
3. Elon Musk is an irredeemable evil who always does objectively wrong things
My first comment was implicitly addressed to people in the 1 camp trying to bring them into the 2 camp (which is where I am).
Alright, it just didn't come across for me, haha! :) I guess sometimes those implicit assumptions really are too implicit! I think it's good to err on the side of expressing them, because you can't assume someone else thinks the same way you do. That's what I've learned anyway. Hahahaha! :)
Reading your comment again with your explanation it is clear that's what you're doing.
Although, regarding your desires to present a balanced view and to persuade, I have an idea. It probably sounds like I have no idea what I'm talking about, but I think your OG comment would perhaps benefit from sounding a little bit more friendly toward Elon (not to the messianic savior level haha), but the way it sounds to me is Elon is being deceptive here and presenting it as goodwill when it's not.
However, I think the truth is there's a little bit of both, right? There's good will but it's also strategic. I get if you don't think so, tho, no worries! Haha! :)
Your OG comment sounds to me like Elon's just Machiavellian, and I get where you're coming from to remind the people who think he's a savior, but if you're point is not to go "against Elon" as you said, it might be good to acknowledge the good that he does.
At least, that way -- whether or not you believe that acknowledgment -- if you hope to bring over people who think that way, you'll probably need to appeal to how they think, rather than just dose them with the truth you see, because then they'll shut it out, if there's nothing they can relate to.
Although, if I haven't convinced you even a bit here, then maybe you shouldn't listen to me about persuasion because I guess I don't know how to do this myself. At least not effectively, or here with you. Haha!:) But if you do feel a little bit convinced then maybe consider it for next time to help your persuading people back to a more balanced view? :)
But then, there's the question of if such a thing is even possible. If people have an particular view, it could be challenging to change it, as confirmation bias means you'll ignore evidence even when it expands your worldview.
Hahaha! :) This was a funny conversation. I think we somehow skirted around the important point tho that OpenAI could in fact open source some of its older models, could it not? Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken, but there might also be a bit of truth to what the fanboys say about OpenAI in that it seems they do have some room to "open source" their non-SOTA stuff, or what am I missing?
If it's better than any other open source LLM does that even matter? (I say "if" because I don't know.)
CODE_OF_CONDUCT.md has only five words. :)
My favorite is SQLite's code of ~~conduct~~ ethics: https://sqlite.org/codeofethics.html
Huh. What's the backstory here?
"Be excellent to each other."
They’re from “Bill and Ted’s Excellent Adventure”
I was hoping it would be "do not be an asshole", but I guess this is fine too.
Well, he delivered.
Partially. Open weights is not open source.
In machine learning models the term open source has been largely accepted to mean sharing weights and, if necessary, inference code. You can argue if this is an abuse of the term but everyone does it, and saying someone didn’t deliver if they used it and published weights would probably mean saying the same about mistral, meta, etc.
Yes. So say the same thing about them Open source has a definition and abusing that hurts all of us except the billionaires.
I get the "open source" argument, but what is the issue here?
If you are able to reproduce the thing in its entirety and you're given no restrictions on its use, it seems compatible with the spirit of open sourcing things.
The architecture of the model is open source. Not just the weights. You can run the entire thing locally.
This seems not be a repo ready to open source. You only get weights, very less information about how the weights is trained and finetuned.
But anyway, it always great to see more LLM weigts available.
Well what constitutes an "open source" model is still controversial and debatable-- lots of people on both sides of that argument.
Open source has had a useful agreed upon meaning for over 25 years. Maybe you're too young to understand why that matters but we're not.
I've been in the open source community for about 25 years so I doubt it.
For what it's worth I would say a model should be fully reproducible to be open source, but that's not a decided consensus -- and AI models are sufficiently different than the source code / binary code distinction as to invoke discussion around defining it.
I would argue that there's no bar for open sourcing aside from "do you have the rights to do so." Some source or some public good is certainly better than none, and when the bar is low then you remove barriers to getting started, vs waiting until you have the time someday to "do it right."
Due to the large size of the model (314B parameters), a machine with enough GPU memory is required to test the model with the example code
What type of machine do you need to play around with this?
Probably a machine with about 628 GB of GPU memory. (2 bytes per parameter)
So 8xH100 (80Gb each) should do it.
I suppose it can be quantizised
'Chunky beast, needs 320 Gb VRAM likely 4 bit, likely is being run 8 bit on 8 x 80 Gb GPUs.'
-Emad
A single 192GB M2 Mac using a 4-bit quant would work.
I am not sure what open source models are accomplishing another than killing the lead from the competition (openai), only to give it to someone else who has expertise in the area of distribution. This will be yet another good addition to systems like Amazon BedRock.
I haven't seen anything about the larger architecture, but I think the value of grok is going to come from it's cheap access to twitter data for RAG etc.
Many of the recent innovations in both LLM architecture and inference were only made possible through open models such as Llama 2 and Mistral 7B as a starting point for iteration and refinement, which in turn backpropagates (heh) back to the LLMs developers.
It's a win-win for everyone. That's the power of open source.
Well, look at the history. Google had an insurmountable lead, so Elon started OpenAI. Now OpenAI has an insurmountable lead too. So everyone else is starting in third place, or lower. David versus two Goliaths. If you try to become a third Goliath, you'll probably just get smashed. You're later to the game. In this situation, going scorched earth becomes a viable strategy. Slay the Goliaths. Become a hero to the masses. Attract the world's best talent who don't want to be associated with proprietary models. At that point you have a world class AI business with momentum towards AGI. And even if you're giving away last year's technology for free, the team you built is churning out new ideas that could be a financial bonanza one day. Shareholders are willing to pay for a long-term bet if the story is good.
"The implementation of the MoE layer in this repository is not efficient. The implementation was chosen to avoid the need for custom kernels to validate the correctness of the model."
Or perhaps release your actual code AND the simplified implementation instead of hiding it and saying "you don't know her, she goes to a different high school"
Always love it when someone gives away a gift and it’s not enough for people.
Not just someone but the CEO of the company. He used HIS platform to say "This week, @xAI will open source Grok" (https://twitter.com/elonmusk/status/1767108624038449405) and they aren't doing that. What they delivered specifically says "We are releasing the base model weights and network architecture of Grok-1, our large language model."
Sounds like they did what they said they would.
Honestly the most interesting part is taking a peek at the kind of AI researcher working for Twitter after the objectively messy layoffs and subsequent crunch. I notice neither of them has Twitter mentioned on their GitHub, which is prolly for the best to avoid harassment lol.
Code wise, excited to see if this could grow into anything! I think it’s pretty clear that Grok didn’t have nearly enough investment to be a top model so Elon “sacrificed” it on a whim in his schoolyard spat with OpenAI, but I’m not complaining. I’ve always took Elon on his word that he truly is worried about centralization of AI, and I don’t think any of the emails released by his schoolmate Altman dissuade me of that. So I have some reasonable hope that he uses some of his immense resources to start “fighting the good fight” here with Le Cun
Neither of them works at Twitter. xAI is a separate company, and only uses Twitter’s data to train.
Thanks for the correction! I know, I just don’t believe in corporations so the distinction is slight
taking a peek at the kind of AI researcher working for Twitter
He made a separate company for this.
Is this the first major model to be natively FP8? I was wondering why people hadn't done it yet. Seems like a big win when hardware supports it.
No, e.g. Yi-34B.
As far as I can tell Yi-34B is natively 16 bit float, the 8 bit version is quantized. https://huggingface.co/01-ai/Yi-34B#quantization
How are people's experience with this model? Having the most weights is one thing but being a better model than the 70B models is another.
I use grok all the time to find tweets or ask about trends on Twitter. For that it's better than what used to exist. But its not a great model outside that narrow use case.
tbh, I've never seen anyone share anything interesting produced by Grok. I see plenty of posts on X and reddit of people sharing amazing things that GPT-4 and now Claude 3 Opus can do. Grok can roast people. That's pretty much all I've seen.
I'd love to proven wrong if someone cares to share something interesting produced by Grok.
What are the languages supported by it?
Tweets.
Those of us who dont spend all our time in LLMs-- whats this about? Whats the big deal and why is it on the front page at #1?
I think this paragraph from an earlier Wired article [1] sums it up pretty well:
"After suing OpenAI this month, alleging the company has become too closed, Elon Musk says he will release his “truth-seeking” answer to ChatGPT, the chatbot Grok, for anyone to download and use."
[1] https://www.wired.com/story/elon-musk-no-choice-open-chatbot...If we just stop looking at Elon, he will lose his power. Why oh why do we keep giving him attention? There are plenty of great models out there that _aren't_ backed by maniacs.
When those great role models are able to build a profitable spaceship company from the ground up I am sure we will pay attention to them.
Love the minimal repo, magnet link, and stating "open weights" instead of "open source". Refreshing!
Elon says open source:
https://twitter.com/elonmusk/status/1767108624038449405?s=46...
Is there a model card anywhere? I'd like to know what it was trained on.
"Base model trained on a large amount of text data, not fine-tuned for any particular task."
Presumably the version they've been previewing on Twitter is an instruction-tuned model which behaves quite differently from these raw weights.
I'd be very curious to see how it performs especially on inputs that's blocked by other models. Seems like Grok will differentiate itself from other OS models from a cencorship and alignment perspective.
This feels like a "now we can say we're open" PR play rather than contributing much value to the open source community.
What is the practical use of this repo?
Model weights on huggingface: https://huggingface.co/xai-org/grok-1
How hard would it be for an open source group to fine tune this into a chatbot?
The only other Repository is a fork of Qdrant.
From issues: “Well the magnet file contains a 300GB checkpoint “
That’s why they are using a torrent I suppose.
Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.
"it really emphasises how important fine tuning is"
Or rather the quality of the training data?
We don't know since no one is releasing their data.
Calling these models open source is like calling a binary open source because you can download it.
Which in this day and age isn't far from where were at.
A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.
You can also build on top of binaries if you use gotos and machine code.
This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.
One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.
You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.
Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.
Are you busy!? I need slaves for a project, I’ll make you slave master / aka Chief Technology Officer. If you can’t take a joke don’t bother answering lol. I’m looking for co-founder
If you don't know the original training data statistical distribution then catastrophic forgetting is guaranteed with any extra training.
I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.
Or shell scripts
You can fine tune without the pre training data too.
Mistral models are one example, they never released pre training data and there are many fine tunes.
Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?
Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."
Or even "here's the Linux Kernel makefiles, no sources included, enjoy".
Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.
How about "weights available" as similar to the "source available" moniker?
weights available or model available, but yes.
We should just call it open weight models at this point.
FWIW the Grok repo uses the term "open weights".
Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?
that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.
I'm sure just like in X's algorithms, @elon tweets are weighted heavily.
The X algorithm is also opensource, so you can verify before commenting..
just because they open sourced it doesn't mean that's actually what they're running on it though
No idea about the current state, but the open sourcing did show, they were favoring elon:
https://mashable.com/article/twitter-releases-algorithm-show...
And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.
It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.
Did you not read the article linked in the comment you're replying to?
No, and that's not what the article says either. They were just tracking how well his tweets were doing versus others. They were not favoring Elon.
"They were just tracking how well his tweets were doing versus others. "
Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:
https://mashable.com/article/elon-musk-super-bowl-joe-biden-...
They officially boost people, who pay a little bit. Elon payed a lot.
And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?
"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."
Also, you probably missed that:
"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."
Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.
See also this HN comment and discussion about it:
https://news.ycombinator.com/item?id=35391854
"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""
Sounds a bit far fetched
So changes in power users stats would also result in audience balancing?
Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.
Most likely the balancing code is somewhere else and it affects only republican / democrats.
It's not like he needs boosting, he was one of Twitter's top followed accounts long before he bought them. He's pretty good at getting attention.
And yet it’s not enough to curb the desire to tip the scales.
https://arstechnica.com/tech-policy/2023/02/report-musk-had-...
X algorithm Github project hasn't been updated in 8 months:
https://github.com/twitter/the-algorithm
So clearly they aren't running it in production.
Also they didn't open source the list of people who are being artificially boosted e.g. Elon.
Are you sure or is it the literal opposite and you’re just speculating?
Or even how much it was trained on this dataset, the amount of FLOPs.
no it empathizes the importance of training smaller models for longer, like the Mistral "overtrained" models
I would say it emphasises that training a good model is more than throwing random data and compute
Current metrics are a poor way to measure the usefulness of LLMs.
Show the proof? Does it include IFT?
It’s not 8x86B. Total number of parameters is 314B.
Perhaps it’s 8x39B to fit on a single 8xA100 (40GB) server?
They all do this marketing bull.
Mixtral has an 8x7B model but it's actually 46.7B, not 56B params.
Kinda similar to how 4K displays are 3840 pixels wide, not true 4K which would be 4096. Marketing people called it 4K, not engineers.
I've always thought of 4K as "4x FullHD". In that way it makes sense.
TV and Digital Cinema have different standards, because of course they do
Bleh no, K means thousand.
For a long time we specified displays by their vertical dimension -- 480p, 720p, 1080p.
Then the marketing guys came along and decided that the horizontal dimension sounds bigger. If we stuck with the less-bullshitty way of doing things and kept comparisons 1:1, we'd call 3840x2160 displays 2160p or "2K" displays, but instead, the marketing people decided that we're going to change things to horizontal and called 3840x2160 "4K".
Most likely it's a MoE of Grok-0 which would be 8x33B + 50B for the router.
Active parameters is 86B, so wouldn't that be the size of the largest two experts (where they may all be the same) + the weights of the selector?
It's actually not the largest. https://huggingface.co/google/switch-c-2048 is 1.6T parameters.