I’ve got my stuff rigged to hit mixtral-8x7, and dolphin locally, and 3.5-turbo, and the 4-series preview all with easy comparison in emacs and stuff, and in fairness the 4.5-preview is starting to show some edge on 8x7 that had been a toss-up even two weeks ago. I’m still on the mistral-medium waiting list.
Until I realized Perplexity will give you a decent amount of Mistral Medium for free through their partnership.
Who is sama kidding they’re still leading here? Mistral Medium destroys the 4.5 preview. And Perplexity wouldn’t be giving it away in any quantity if it had a cost structure like 4.5, Mistral hasn’t raised enough.
Speculation is risky but fuck it: Mistral is the new “RenTech of AI”, DPO and Alibi and sliding window and modern mixtures are well-understood so the money is in the lag between some new edge and TheBloke having it quantized for a Mac Mini or 4070 Super, and the enterprise didn’t love the weird structure, remembers how much fun it was to be over a barrel to MSFT, and can afford to dabble until it’s affordable and operable on-premise.
“Hate to see you go, love to watch you leave”.
I have 20 years of software development experience, and I couldn’t understand anything you said. Is there a dictionary for this new lingo, or am I just too mid?
He speaks very unclearly, instead of saying GPT-4-turbo he says 4.5 preview. 4.5 is invention of his.
Also mixtral medium - no idea of what he means by that.
Not to mention a claim that mixtral is as good as gpt-4. It’s on the quality of gpt3.5 at best, which is still amazing for an open source model, but a year behind openai
I just spoke all night to 8x7B and can say that it sucks much less than 3.5. It doesn’t screw up and apologize all the time (and screw up again) and doesn’t repeat what I just said verbatim. That is on topics I have a decent expertise in myself. Never had this experience of periodically forgetting that it’s not a human company with 3.5.
How are you accessing it?
Local setup, “text generation webui”, TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF (Q4_K_M) on HF. You can run it on a decent intel cpu, takes around 32.5GB of ram including os (8gb for me). GPU with tensorcores can speed up few layers if you have one, but isn’t required. I get around 2.5-3 t/s with 8700 and 4070ti, that’s enough for chats that require some thinking.
Edit: I was using 2k window, a larger one would probably eat more ram. But even with 2k it didn’t feel like it loses context or something.
For macOS and Linux, Ollama is probably the easiest way to try Mixtral (and a large number of models) locally. LM Studio is also nice and available for Mac, Windows, and Linux.
As these models can be quite large and memory intensive, if you want to just give it a quick spin, huggingface.co/chat, chat.nbox.ai, and labs.pplx.ai all have Mixtral hosted atm.
You can access it through Poe.com.
Sorry, but there's little that's unclear about what he said.
"mixtral medium" is just a typo: he means mistral-medium.
And GPT 4.5 is certainly not an "invention of his". Whether it exists or not (which is debatable, OpenAI said it was just mentioned in a GPT 4 hallutination and caught on), it' s a version name thrown around for like a month in forums, blog posts, news articles and such.
Being thrown around in forums doesn't make it more true..
But it makes it not an invention of benreesman
Which is neither here nor there, since I din't claim that it is true.
I claimed it's not an "invention of his [benreesman ]", but a term that was already out there.
Mistral-medium is a model that mistral serves only via API since it's a prototype model. It hasn't been released yet and it's bigger than the mixtral-8x7b model
I think with Mixtral Medium they mean MoE 2x13B which is on top on huggingface leaderboard? It is still not close to 8x175B, but size alone is not most important factor. With smarter training methods and data it is possible we will see performance similar to gpt-4 in open source mixture of experts of smaller sizes.
Oh thank you, I was reading and none of that made any sense to me. I thought it could be a presentation of some dumb AI output. Now I see I’m not alone.
That gave me an idea, here is what I got from Copilot:
You have set up your system to run different AI models and compare their performance using a text editor. You are using Mixtral-8x7, a high-quality open-source model developed by Mistral AI, Dolphin, an emulator for Nintendo video games, 3.5-Turbo, a customized version of GPT-3.5, a powerful natural language model, and 4-Series Preview, a new version of the BMW sports coupe. You have noticed that the 4.5-Preview, an upcoming update of GPT-3.5, is slightly better than Mixtral-8x7, which used to be a close match. You are still waiting to access Mistral-Medium, a prototype model that is even better than Mixtral-8x7, but only available to a limited number of users.
You have discovered that Perplexity, an AI company that provides information discovery and sharing services, offers free access to Mistral-Medium through their partnership with Mistral AI. You think that Perplexity is making a mistake by giving away such a valuable model, and that they are underestimating the superiority of Mistral-Medium over the 4.5-Preview. You also think that Mistral AI is the new leader in the AI industry, and that their techniques, such as DPO (Data Processing Optimization), Alibi (a library for algorithmic accountability), sliding window (a method for analyzing time series data), and modern mixtures (a way of combining different models), are well-known and effective. You believe that the advantage of Mistral AI lies in the gap between their innovation and the ability of other developers to replicate it on cheaper and more accessible hardware. You also think that the enterprise market is not fond of the complex structure of GPT-3.5 and its variants, and that they prefer to use Mistral AI's models, which are more affordable and operable on their own premises.
You end your text with a quote from the movie Armageddon, which implies that you are leaving a situation that you dislike, but also admire.
I dont think it has anything to do with a BMW sports coupe.
What I am confused about though is it seems like the parent is mentioning models beyond the GPT4 instance I currently have access to. I checked their twitter and I have seen no anouncement for any 4.5 or 4 series previews. Is this just available to people using the API or did I miss something?
Well, the Paul Ricard circuit in France has a straight called Mistral. Plenty of BMWs have been there for sure, and a zillion other cars.
I wonder if that could have confused the AI a little in combination with other hints. Turbo?
If that's a thing maybe we should start picking our names not only to make them googlable but also not to confuse LLMs at least for the next few years. Months?
OP might be referring to gpt-4-turbo.
The OpenAI API is currently advertising their preview to my clients as `gpt-4-1106-preview`. I've been calling the Q4 2023 4-series stuff `4.5` or `4.5-preview` to distinguish it from the pre-big-announcement stuff.
I don't recall if I saw any press calling anything `4.5`, but it's a different model in some important ways (one suspects better/cheaper quantization at a minimum) and since they've used `.5` for point releases in the past it seemed the most consistent with their historical versioning.
https://imgur.com/WDrqxsz
What kind of emacs distribution is on the screenshot?
It's a new theme and/or distribution that I'm working on under the working title `hyper-modern`. It clearly takes inspiration from things like Doom (and I'm using their modeline which is hard to improve on) but it's mostly ground up and AI-first.
It's heavily integrated with my custom model server and stuff and I'm slowly getting it integrated with other leading tools (vscode and nvim and stuff).
I plan to MIT it all once it's at a reasonable RC. If I get there it will be available at `https://hyper-modern.ai` and `https://github.com/hyper-modern-ai`.
Thanks for asking!
I lost it when it brought in the BMW. Thanks for the laughs.
"Dolphin, an emulator for Nintendo video games", but in this context it refers to "Dolphin, an open-source and uncensored, and commercially licensed dataset and series of instruct-tuned language models based on Microsoft's Orca paper." https://erichartford.com/dolphin
Yeah that was completely incoherent to me as well.
Same bro
My apologies to both you and grandparent, I faux-pass'd on using such sloppy jargon in a thread of general interest. If I can be useful in decoding some of that and helping to keep this topic accessible to the broader community I'd be happy to answer any questions about what I meant. I've cited a few useful resources elsewhere in the thread and am always eager to talk shop on topics like this, so fire away if you have any questions about the field or what the hell I was talking about :)
surely this guy is a bot.
or spending too much time with bots...
Now that seems more likely for better or worse :)
For a bot/spam thing, 16 years seems like a lot of planning and/or effort to put in? I don't know the typical age of people on HN these days but I can't imagine there are a lot of 2008 joins still floating around as a percentage of all commenters.
A paragraph by paragraph "dumbed down" translation of your original words would be pretty neat to have for starters. Both to understand what you mean but also to understand the lingo.
I'm hardly the best person to give a point-by-point on how modern neural networks work. The original paper that kind of brought together a bunch of ideas that were floating around is called "Attention is All You Need" in 2017 (and those folks are going to win a Turing almost certainly) and built on a bunch of `seq2seq` and Neural Turing Machine stuff that was in the ether before that.
Karpathy has a a great YouTube series where he gets into the details from `numpy` on up, and George Hotz is live-coding the obliteration of PyTorch as the performance champion on the more implementation side as we speak.
Altman being kind of a dubious-seeming guy who pretty clearly doesn't regard the word "charity" the same way the dictionary does is more-or-less common knowledge, though not often mentioned by aspiring YC applicants for obvious reasons.
Mistral is a French AI company founded by former big hitters at e.g. DeepMind that brought the best of the best on 2023's public domain developments into one model in particular that shattered all expectations of both what was realistic with open-weights and what was possible without a Bond Villain posture. That model is "Mixtral", an 8-way mixture of experts model using a whole bag of tricks but key among them are:
- gated mixture of experts in attention models - sliding window attention / context - direct-preference optimization (probably the big one and probably the one OpenAI is struggling to keep up with, probably more institutionally than technically as probably a bunch of bigshots have a lot of skin in the InstructGPT/RLHF/PPO game)
It's common knowledge that GPT-4 and derivatives were mixture models but no one had done it blindingly well in an open way until recently.
SaaS companies doing "AI as a service" have a big wall in front of them called "60%+ of the TAM can't upload their data to random-ass cloud providers much less one run by a guy recently fired by his own board of directors", and for big chunks of finance (SOX, PCI, bunch of stuff), medical (HIPAA, others), defense (clearance, others), insurance, you get the idea: on-premise is the play for "AI stuff".
A scrappy group of hackers too numerous to enumerate but exemplified by `ggerganov` and collaborators, `TheBloke` and his backers, George Hotz and other TinyGrad contributors, and best exemplified in the "enough money to fuck with foundation models" sense by Mistral at the moment are pulling a Torvalds and making all of this free-as-in-I-can-download-and-run-it, and this gets very little airtime all things considered because roughly no one sees a low-effort path to monetizing it in the capital-E enterprise: that involves serious work and very low shady factors, which seems an awful lot like hard work to your bog-standard SaaS hustler and offers almost no mega data-mining opportunity to the somnobulent FAANG crowd. So it's kind of a fringe thing in spite of being clearly the future.
What is "Mixtral Medium"? Is it a next version of Mixtral and can it be downloaded somewhere?
No. Mixtral Medium is one of the proprietary models by Mistral AI, and only accessible via their API: https://docs.mistral.ai/
It currently ranks 4 in chatbot arena leaderboard (slightly behind GTA-4 ELO rating): https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
Now you know how your girlfriend feels when she hears you speak with other software people :) Excuse my assumptions if they are incorrect. I'm making projections from my own point of view.
Romeo here flexing about how he has an actual girlfriend. Sorry, Casanova, we’re busy refining mobile-ready inference models and leetcoding while you go on your so-called “dates”
Reddit-inspired humor on HN done right, hat tip, almost impossible to do (and I've tried for 15 years with less success).
lulz.
tbh I expected serious downvotage
We developers only go on ISO 8601 dates, which somewhat limits the selection unfortunately.
Reminds me of something I watched recently - "Struggling to understand Gen Z Slang" [0].
[0]: https://www.youtube.com/watch?v=qGx4VtwMnfM
On reflection this thread is pretty clearly of general interest and my comment was more jargon that language, I hang out in ML zones too much.
For a broad introduction to the field Karpathy's YouTube series is about as good as it gets.
If you've got a pretty solid grasp of attention architectures and want a lively overview of stuff that's gone from secret to a huge deal recently I like this treatment as a light but pretty detailed podcast-type format: https://arize.com/blog/mistral-ai
It was entertaining to follow and cross check for sanity - admittedly I first read "RenTech" as a reference to https://theitcrowd.fandom.com/wiki/Reynholm_Industries
.. which seemed to fit suprisingly well.
Ironically, one of the best scenes from the series (Reynholm trying to wake his desktop by shouting "HELLO COMPUTER! HELLO!!" for hours and hours) feels very "ai"...
I'm trying to go deeper, I'd be curious to know what other ML zones you keep track of?
- I know r/LocalLlama, huggingface's Daily Papers and TheBloke. Most of what Youtube throws at me is horrific clickbait. I feel like there are probably whole communities I'm missing out on.
I appreciate it, will check it out :)
respectfully, 20 yrs of software dev experience doesn't entitle you to understand the last 2 months of AI if you didn't spend the effort to keep up. jargon happens, its not your fault but also people need to communicate thoughts concisely given a base of knowledge. its ok to ask of course but the rest of us who have been keeping up can parse this well enough (even though I disagree with some of the assertions)
I'm snarkier than most on HN and have the scars to prove it, and I do miss the RTFM-by-default vibe of the early days, but on this one topic as you can see I'm going out of my way to apologize for being cryptic and try to include everyone in the conversation because this shit matters a lot even by our standards.
Give me a hand?
i do my part :) i do lots of explaining in audio and text https://latent.space/p/dec-2023
am just here to vibe in my down time
Glad to make your acquaintance as my folks used to say. I'll check out the link and remember the handle.
They are referring to LLM models. It‘s not about how much software dev experience you have
I have heard of LLMs, and understand most everything posted on HN, except quantum computing stuff.
I think in this case it's much like the difference between understanding what Pokemon are, and actively playing each game so you know the names of most Pokemon.
Just follow https://www.reddit.com/r/localllama to keep up to date on this stuff
Thanks!
"just" is carrying a loooot of weight there fyi. but given your 20 years... you should have some appreciation of that :)
We're all too mid. Luckily, these days we hoomans have AIs to help us understand other hoomans. Here is Gpt-4-1106-preview and Perplexity.ai versions trying to shed some light what was being said. https://pastebin.com/JuxfdrLg
Hilariously neither knows who is sama (Sam Altman, the Drama King of OpenAI), nor do they recognize when they themselves are being discussed.
Reading the responses in full also gives you a glimpse on specific merits or weaknesses of these systems, namely how up to date is their knowledge and lingo, explaining capabilities, and ability to see through multiple layers of referencing. Also showcases whether the AIs are willing to venture guessing to piece together some possible interpretation for hoomans to think about.
I absolutely love pointing these things at each other and watching them go.
I screen-capped my take on this to prove* that I was actually wiring all this stuff up and plug my nascent passion/oss project, but it's really funny comparing them either way: https://imgur.com/WDrqxsz
He is all over the place, mixing tech specifics with unproven models.
Basically, he said he is happy with Mistral 8x7B and thinks it is on par/better comparing to OpenAI's closed source model.
Well the lmsys leaderboard interestingly places it just 10 Elo under GPT 4 entries, so one would have to consider it somewhat on-par.
Just had to say that the original comment, and then yours right after, is a great combo. Laughed my ass off :)
Normally I'm the mid-ass, out of touch old guy, it's a rare chance to be the one using the insider lingo :)
Half LLM, half boomer
Its a specific lingo evolved over last two years with rise of llms. Those who have been following development of LLMs would understand it.
Here's a glossary to understand this post:
- mixtral-8x7 or 8x7: Open source model by Mistral AI.
- Dolphin: An uncensored version of the mistral model
- 3.5-turbo: GPT-3.5 Turbo, the cheapest API from OpenAI
- 4-series preview OR "4.5 preview": GPT-4 Turbo, the most capable API from OpenAI
- mistral-medium: A new model by Mistral AI that they are only serving through AI. It's in private beta and there's a waiting list to access it.
- Perplexity: A new search engine that is challenging Google by applying LLM to search
- Sama: Sam Altman, CEO of OpenAI
- RenTech: Renaissance Technologies, a secretive hedge fund known for delivering impressive returns improving on the work of others
- DPO: Direct Preference Optimization. It is a technique that leverages AI feedback to optimize the performance of smaller, open-source models like Zephyr-7B1.
- Alibi: a Python library that provides tools for machine learning model inspection and interpretation2. It can be used to explain the predictions of any black-box model, including LLMs.
- Sliding window: a type of attention mechanism introduced by Mistral-7B3. It is used to support longer sequences in LLMs.
- Modern mixtures: The process of using multiple models together, like "mixtral" is a mixture of several mistral models.
- TheBloke: Open source developer that is very quick at quantizing all new models that come out
- Quantize: Decreasing memory requirements of a new model by decreasing the precision of weights, typically with just minor performance degradation.
- 4070 Super: NVIDIA 4070 Super, new graphics card announced just a week ago
- MSFT: Microsoft
I asked ChatGPT to rewrite the original post using your glossary, which worked well:
I've set up my system to use several AI models: the open-source Mixtral-8x7, Dolphin (an uncensored version of Mixtral), GPT-3.5 Turbo (a cost-effective option from OpenAI), and the latest GPT-4 Turbo from OpenAI. I can easily compare their performances in Emacs. Lately, I've noticed that GPT-4 Turbo is starting to outperform Mixtral-8x7, which wasn't the case until recently. However, I'm still waiting for access to Mistral-Medium, a new, more exclusive AI model by Mistral AI.
I just found out that Perplexity, a new search engine competing with Google, is offering free access to Mistral Medium through their partnership. This makes me question Sam Altman, the CEO of OpenAI, and his claims about their technology. Mistral Medium seems superior to GPT-4 Turbo, and if it were expensive to run, Perplexity wouldn't be giving it away.
I'm guessing that Mistral AI could become the next Renaissance Technologies (a hedge fund known for its innovative strategies) of the AI world. Techniques like Direct Preference Optimization, which improves smaller models, along with other advancements like the Alibi Python library for understanding AI models, sliding windows for longer text sequences, and combining multiple models, are now well understood. The real opportunity lies in quickly adapting these new technologies before they become mainstream and affordable.
Big companies are cautious about adopting these new structures, remembering their dependence on Microsoft in the past. They're willing to experiment with AI until it becomes both affordable and easy to use in-house.
It's sad to see the old technology go, but exciting to see the new advancements take its place.
The GP did a great job summarizing the original post and defining a lot of cryptic jargon that I didn't anticipate would generate so much conversation, and I'd wager did it without a blind LLM shot (though these days even that is possible). I endorse that summary without reservation.
And the above is substantially what I said, and undoubtedly would find a better reception with a larger audience.
I'm troubled though, because I already sanitize what I write and say by passing it through a GPT-style "alignment" filter in almost every interaction precisely because I know my authentic self is brash/abrasive/neuro-atypical/etc. and it's more advantageous to talk like ChatGPT than to talk like Ben. Hacker News is one of a few places real or digital where I just talk like Ben.
Maybe I'm an outlier in how different I am and it'll just be me that is sad to start talking like GPT, and maybe the net change in society will just be a little drift towards brighter and more diplomatic.
But either way it's kind of a drag: either passing me and people like me through a filter is net positive, which would suck but I guess I'd get on board, or it actually edits out contrarian originality in toto, in which case the world goes all Huxley really fast.
Door #3 where we net people out on accomplishment and optics with a strong tilt towards accomplishment doesn't seem to be on the menu.
I would have said there is no problem with your style (nothing brash/abrasive), but you used a lot of jargon, that people who are not very deep into LLMs (large language models) would not understand. Interests of hackernews visitors are very diverse, not everyone follows LLMs that closely.
This was my take exactly. I read the original and thought, "Wow, this sounds like really interesting stuff this poster us excited about. I wish I knew what the terms meant, though. I'll have to come back to this when I have more time and look up the terms."
I was pleasantly surprised to find a glossary immediately following, which tells me it wasn't the tone of the post, but the shorthand terminology that was unfamiliar to me that was my issue.
I think writing in "Ben's voice" is great. There are just going to be times when your audience needs a bit more context around your terminology, that's all.
People aren’t passing you through a filter because you are brash and undiplomatic and “unaligned”, it’s because your communication style is borderline incomprehensible.
FWIW I found it comprehensible just jargon-laden
Porque no los dos? While I truly appreciate your OP and could grok it even though I don't know the tech, the summary and rewrites saved me a ton of googling. I hope one day we have a 'see vernacular/original' button for all thought and communication so people can choose what level to engage in without the author having to change their communication style. Translation for personal dialects, so to say.
Personally I had no problem with your post being too abrasive or anything like that, it was just hard to understand because it was full of jargon.
I think the only thing you really need to do is unpack your jargon so people who aren't exactly you can understand what you're saying. Even on this site, there are folks with all sorts of different experiences and cultural context, so shortcuts in phrasing don't always come across clearly.
For example, "in which case the world goes all Huxley really fast." "Huxley" apparently means something to you. Would it mean anything at all to someone who hasn't read any Aldous Huxley? As someone who _has_, I still had to think about it -- a lot. I assumed you're referring to a work of his literature rather than something he actually believed, as Huxley's beliefs about the world certainly had a place for the contrarian and the original.
Further, I assume you are referring to his most well-known work, _Brave New World_, rather than (for example) _Island_, so you're not saying that people would be eating a lot of psychedelic mushrooms and living together in tolerant peace and love.
I don't at all think you need to sound like GPT to be a successful communicator, but you will be more successful the more you consider your audience and avoid constructions that they're unlikely to be able to understand without research.
Real Ben >> GPT Ben. However if you are going to the wider world you problem need to self varnish a lot (i know i would have to). You are fine in here!
What you are alluding to is quite similar to the that “instagram face” that everyone pursues and self filters for except its more about your communication and thoughts. Also the argument that you need to reach a wider audience i dint think isn't necessary unless you want the wider audience to comment and engage.
The internet is the great homogenizer soon(ish) we will be uniform.
I used to struggle a lot in communication for talking to people in the authentic self way you just described. Being too direct and telling my point of view in such a way has caused tension with family, colleagues and the girlfriend.
The moment I change the way I talk and say instead of "That's bullshit, let's move away from it" to "That could be a challenging and rewarding experience", and I can already see the advantage.
I rather like to talk the way I want, but I see it as challenging and not that rewarding as people seem to get more sensitive. That made me wonder if the way GPT-style chatbots communicate with humans would make humans expect the same way of communication from other humans.
Did you just paste that into an LLM and asked it to create a glossary? :-P
(but seriously: Thanks !)
Emil didn't, but I did (and yeah, it's useless):
Mixtral-8x7: This appears to be a technical term, possibly referring to a software, framework, or technology. Its exact nature is unclear without additional context.
Dolphin locally: "Dolphin" could refer to a software tool or framework. The term "locally" implies it is being run on a local machine or server rather than a remote or cloud-based environment.
3.5-turbo: This could be a version name or a type of technology. "Turbo" often implies enhanced or accelerated performance.
4-series preview: Likely a version or iteration of a software or technology that is still in a preview or beta stage, indicating it's not the final release.
Emacs: A popular text editor used often by programmers and developers. Known for its extensibility and customization.
Mistral Medium: This might be a product or service, possibly in the realm of technology or AI. The specific nature is not clear from the text alone.
Perplexity: Likely a company or service provider, possibly in the field of AI or technology. They seem to have a partnership offering involving Mistral Medium.
RenTech of AI: RenTech, or Renaissance Technologies, is a well-known quantitative hedge fund. The term here is used metaphorically to suggest a pioneering or leading position in the AI field.
DPO, Alibi, and sliding window: These are likely technical concepts or tools in the field being discussed. Without additional context, their exact meanings are unclear.
Modern mixtures: This could refer to modern algorithms, techniques, or technologies in the field of AI or data science.
TheBloke: This could be a reference to an individual, a role within a community, or a specific entity known for certain expertise or actions.
4070 Super: This seems like a model name, possibly of a computer hardware component like a GPU (Graphics Processing Unit).
MSFT: An abbreviation for Microsoft Corporation.
On-premise: Refers to software or services that are operated from the physical premises of the organization, as opposed to being hosted on the cloud.
This is actually hilarious. It looks like a student who did not learn for the exam but still tries their best to scratch a point or two by filling the page with as many reasonnable sounding statements (a.k.a. "bullshit") as they can. Not that I expect more of a language model, no matter how "large".
This makes it actually more human (as in this is how a real student would have reacted) and therefore AGI-like than i would have anticipated.
GPT4 is able to look terms up on the Internet if you ask, and will give you a list of specs on it, with a cite so you know it's not hallucinating them.
"4070 Super": https://chat.openai.com/share/0aac7d90-de65-41d0-9567-8e56a0...
"Mixtral-8x7": https://chat.openai.com/share/8091ac61-d602-414c-bdce-41b49e...
That's a weirdly dismissive statement. The fundamental problem is that a lot of these terms are from after the AI's cutoff point. It's perfectly able to handle terms like "Emacs", "RenTech" or "MSFT", and it can guess that "4070 Series" probably refers to a GPU.
ChatGPT in a few years will probably be perfectly able to produce the correct answers.
(Actually, ChatGPT consistently claims its current cutoff is April 2023, which should let it give a better answer, so I'm taking a few points off my explanation. But it still feels like the most probable one.)
I think you've done a great explanation expansion except I believe it's ALiBi ("Attention with Linear Biases Enables Input Length Extrapolation"), a method of positional encoding (i.e. telling the Transformer model how much to weight a distant token when computing the current output token). This has been used on various other LLMs[2].
[1]: https://arxiv.org/abs/2108.12409
[2]: n.b. Ofir Press is co-creator of ALiBi https://twitter.com/OfirPress/status/1654538361447522305
This is indeed what I was referring to and along with RoPE and related techniques is a sort of "meta-attention" in which a cost-effective scalar pointwise calculation can hint the heavyweight attention mechanism with super-linear returns in practical use cases.
In more intuitive terms, your bog-standard transformer overdoes it in terms of considering all context equally in the final prediction, and we historically used rather blunt-force instruments like causally masking everything to zero.
These techniques are still heuristic and I imagine every serious shop has tweaks and tricks that go with their particular training setup, but the Rope shit in general is kind of a happy medium and exploits locality at a much cheaper place in the overall computation.
My understanding is that Mistral uses a regular 4K RoPE that is "extends" the window size with SWA. This is based on looking at the results of Nous Research's Yarn-Mistral extension: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k and Self-Extend, both of which only apply to RoPE models.
There are quite a few recent attention extension techniques recently published:
* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462
* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325
* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669
* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071
You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.
There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...
imo mistral-medium is worse than mixtral. Do you have API access?
As someone who follows AI pretty closely, this was unbelievably helpful in understanding the parent post. It's crazy how much there is to keep on top of if you don't want to fall behind everything that is going on in AI at the moment.
I love you, Emil
Crazy, your post feels like downloading martial arts in the Matrix. I read the parent, didn't get a thing and though the guy was on substances. Read yours. Read the parent again. I speak AI now! I'm going to use this new power to raise billions!
Thanks for this. I was initially wondering what this new GPT 4.5 model was and if I had somehow missed out on something big.
I'm clearly spending far too much time tuning/training/using these things if a glossary to make my post comprehensible to HN is longer than my remark: thank you for correcting my error in dragging this sub-sub-sub-field into a thread of general interest.
That's an impressive list of jargon whaha
Love how deep the rabbithole has gone in just a year. I am unfortunately in the camp of understanding the post without needing a glossary. I should go outside more :|
On what metrics? LMSys shows it does well but 4-Turbo is still leading the field by a wide margin.
I am using 8x-7b internally for a lot of things and Mistral-7b fine-tunes for other specific applications. They're both excellent. But neither can touch GPT-4-turbo (preview) for wide-ranging needs or the strongest reasoning requirements.
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
EDIT: Neither does mistral-medium, which I didn't discuss, but is in the leaderboard link.
Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.
There's also very little if any credible literature on what constitutes statistically significant on MMLU or whatever. There's such a massive vested interest from so many parties (the YC ecosystem is invested in Sam, MSFT is invested in OpenAI, the US is invested in not-France, a bunch of academics are invested in GPT-is-borderline-AGI, Yud is either a Time Magazine cover author or a Harry Potter fanfic guy, etc.) in seeing GPT-4.5 at the top of those rankings and taking the bold one at < 10% lift as state of the art that I think everyone should just use a bunch of them and optimize per use case.
I have my own biases as well and freely admit that I love to see OpenAI stumble (no I didn't apply to work there, yes I know knuckleheads who go on about the fact they do).
And once you factor in "mixtral is aligned to the demands of the user and GPT balks at using profanity while happily taking sides on things Ilya has double-spoken on", even e.g. MMLU is nowhere near the whole picture.
It's easy and cheap to just try both these days, don't take my word for which one is better.
I literally use 8x-7b on my on-prem GPU cluster and have several fine tunes of 7b (which I said in the previous post). I've used mistral-medium.
GPT-4-turbo is better than them all on all benchmarks, human preference, and anything that isn't biased vibes. My opinion - such that it is - is that GPT-4-turbo is by far the best.
I have no vested interest in it being the best. I'd actually prefer if it wasn't. But all objective data points to it being the best and most lived experiences that are unbiased agree (assuming broad model use and not hyperfocused fine-tunes; I have Mistral-7b fine-tunes beating 4-turbo in very limited domains, but that hardly counts).
The rest of your post I really have no idea what's going on, so good luck with all that I guess.
Mistral Medium beats 4.5 on the censorship benchmark. It doesn't refuse to help with anything that could be vaguely non-PC or could potentially be used to hurt anyone in the wrong hands, including dangerously hot salsa recipes.
That's not a metric.
That's a use case.
Certainly, no one here is arguing that there are things openai refuses to allow, and given that the effectiveness of using GPT4 on them is literally zero, a sweet potato connected to a spring and keyboard will "beat" GPT-4, if that's your scoring metric.
If you want a meaningful comparison you need tasks that both tools are capable of doing, and then see how effective they are.
Claiming that mistral medium beats it is like me claiming the RenderMan beats DALLE2 at rendering 3d models; yes, technically they both generate images, but since it's not possible to use DALLE2 to render a 3d model, it's not really a meaningful comparison is it?
The fact it’s incapable of simple requests that an alternative can is absolutely part of a worthwhile comparison.
Semantics.
Both tools are generative systems that produce text in response to a prompt. If Mistral was mute on random topics for no other reason that its makers dislike talking about that, would you say it doesn't count?
I'm a big proponent of freedom in this space (and remain one), but Dolphin is fucking scary.
I don't have any use cases for crime in my life at the moment beyond wanting to pirate like Adobe Illustrator before signing up for an uncancelable subscription, but it will do arbitrary things within it's abilities and it's google with a grudge in terms of how to do anything you ask. I stopped wanting to know when it convinced me it could explain how to stage a coup d'etat. I'm back on mixtral-8x7b.
Agree with this. I would say that the rate of progress from Mistral is very encouraging though in terms of having multiple plausible contenders for the crown.
Sorry but you're talking complete nonsense here. The benchmark by LMSys (chatbot arena) cannot be gamed, and Ravenwolf is a random-ass poster with no scientific rigor to his benchmarks.
Cannot be gamed? C'mon now... You could pay a bunch of people to vote for your model in the arena.
No you can't, because you actually don't know which model is which when you vote.
Do only the initial votes count? Because after I made an initial choice I was then put in a session where I saw the name of both of the AI. I made subsequent votes in that session where I could see their names.
It just feels like “what LLM is better” becomes new “what GPU is better” type of talk. It’s great to find a clear winner, but at the end the gap between the leaders isn’t an order of magnitude.
These days the question is more about which LLM is second best. It’s very tight while ChatGPT 4 is in its own league.
I think people are missing the context that the prices of even the largest LLMs trend towards $0 in the medium term. Mistral-medium is almost open source, and we are still early days
Curious that you mentioned "4.5-preview". What do you mean there?
To my knowledge, and I searched to confirm, GPT-4.5 is not yet released. There were some rumors and a link to ChatGPT's answer about GPT-4.5 (could also be a hallucination) but Sam tweeted it was not true.
They literally made it up.
That seems a little harsh. There was clearly what amounted to an internal point release in Q4, there was a big announcement and the historical convention on versioning is `.5` increments.
It's "unofficial" but "literally made it up" seems a bit unfair, it's not like I called it `GPT-4-Ti Founders Edition` and tried to list it on eBay.
They meant GPT-4 Turbo, which is an improvement over GPT-4.
Can you share some examples of how you are using it? Mixtral that is? What's your setup? What's your flow/workflow?
I screenshotted my emacs session upthread in a bit of a cheeky "AI-talking-about-AI" joke: https://imgur.com/WDrqxsz.
While I heavily rely on `emacs` as my primary interface to all this stuff, I'm slowly-but-surely working on a curated and opinionated collection of bindings and tools and themes and shit for all the major hacker tools (VSCode, `nvim`, even to a degree the JetBrains ecosystem). This is all broadly part of a project I'm calling `hyper-modern` which will be MIT if I get to a release candidate at all.
I have a `gRPC` service that wraps the outstanding work by the "`ggeranov` crew" loosely patterned on the sharded model-server architectures we used at FB/IG and mercilessly exploiting the really generous free-plan offered by the `buf.build` people (seriously, check out the `buf.build` people) in an effort to give hackers the best tools in a truly modern workflow.
It's also an opportunity to surface some of the outstanding models that seem to have sunk without a trace (top of mind would be Segment Anything out of Meta and StyleTTS which obsoletes a bunch of well-funded companies) in a curated collection of hacker-oriented capabilities that aren't clumsy bullshit like co-pilot.
Right now it's a name and a few thousand lines of code too rough to publish, but if I get it to a credible state the domain is `https://hyper-modern.ai` and the code will be MIT at `https://github.com/hyper-modern-ai/`.
Feel free to push the code. Elisp should be hacky and embarrassing otherwise you are doing it wrong :)
Speculative musings beckon, and we dare to embrace them. The crux of the matter appears to be the chasm that separates novel advancements from the moment they are quantified for mainstream consumption. Retaining vivid memories of past entanglements with industry titans, circumspectly explore and exploit these innovations until they become both affordable and practicable for on-premise utilization, finally unveiling competitive prowess. The overarching question looms large. Perhaps, Mistral has not yet amassed the financial resources commensurate with such largesse.
"My hips don't lie."
https://gist.github.com/b7r6/fde6fb3be9a752a989054e62905307f...
Not bad! It's having trouble with syllables tho.
Was this generated by some AI? It it a parody?
I've made similar apologies upthread but I'm passionate about this being an inclusive conversation and so I'm trying to respond to everyone who I confused with all the jargon.
The trouble with the jargon is that it obfuscates to a high degree even by the standards of the software space, and in a field where the impact on people's daily lives is at the high end of the range, even by the standards of the software space.
HN routinely front-pages stuff where the math and CS involved is much less accessible, but for understandable reasons a somewhat tone-deaf comment like mine is disproportionately disruptive: people know this stuff matters to them either now or soon, and it's moving as quickly as anything does, and it's graduate-level material.
If you have concrete questions about what probably looks like word salad I'll do my best to clarify (without the aid of an LLM).
Thanks for the insights. What is your typical Emacs workflow for using and comparing the models?
I'm running custom stuff that I plan/hope to MIT soon, but `gptel` is killer and I've substantially plagiarized feature-wise it in my own dots. (I don't intend to release anything under a more permissive license that it was published under, merely that it sets the bar on a good interface and I plan to launch nothing less capable).
Do you have a source on Mistral/Mixtral using that?
No, they could be using any of the variants of pointwise scalar trig-style embedding, one imagines it's at least a little custom to their particular training setup.
It was just an example of a modern positional encoding. I regret that I implied inside knowledge about that level of detail. They're doing something clever on scalar pointwise positional encoding but as for what who knows.
Any chance you could post some comparisons between Mistral medium and gpt-4 turbo? I'm curious where you think it's more impressive, I hadn't spent the time to evaluate it yet.
Go to the Arena (side-by-side) tab on LMsys and you can try it yourself!
https://chat.lmsys.org/
It's a great tool they make available.
I'm still waiting for the AI encabulator.
Had a good laugh about your comment, then realized that this is _exactly_ what AI would be really good at...
Basically let an AI hallucinate on some technical subject. It would make a great script for a new encabulator video.
I'm curious about your workflow including all of these, is it only for your curiousity? Do you switch between them for specific tasks, or even run them in parallel for some purpose?
Also, is anyone aware of a service that supplies API endpoints for dolphin? I'd love to experiment with it, but running locally exceeds my budget.
What a non sensical statement
I understand some of these words.
In all seriousness, are self hosted GPT alternatives really viable?
Care to share, what are you using it for?
I'm curious, because I'm gathering some usecases; so that I could share that internally in the company to provide better education on, what LLMs do and how they work.
Not sure what all the fuss is about about the incomprehensibility of this. It's a densely packed comment, information wise, and expects familiarity with the field, but there's nothing really that obscure about it.
I might not know half of the references like "sama" or "TheBloke", but I could understand the context of them all. Like:
"the lag between some new edge and TheBloke having it quantized for a Mac Mini or 4070 Super,"
Not sure who TheBloke is, but he obviously means "between some new (cutting) edge AI model, and some person scaling it to run on smaller computers with less memory".
Similarly, not sure who Perplexity is, but "Until I realized Perplexity will give you a decent amount of Mistral Medium for free through their partnership" basically spells out that they're a service provider of some kind, that they have partnered with Mistral AI, and you get to use the Mistral Medium model through opening a free account on Perplexity.
I mean, duh!
Dolphin-mixtral is incredible for the size that it is. But I'm curious, have you tried Goliath-120b or the new `Mixtral_34Bx2_MoE_60B` (it's named Mixtral but the base is actually Yi).
Goliath is too big for my system but Mixtral_34Bx2_MoE_60B[1] is giving me some really good results.
PSA to anyone that does not understand what we're talkign about: I was new to all of this until two weeks ago as well. If you want to get up to speed with the incredible innovation and home-tinkering happening with LLMs, you have to checkout - https://www.reddit.com/r/LocalLLaMA/
I believe we should be at GPT4 levels of intelligence locally sometime later this year (Possibly with the release of Llama3 or Mistral Medium open-model).
[1] - https://huggingface.co/TheBloke/Mixtral_34Bx2_MoE_60B-GGUF