Are Open-Source Large Language Models Catching Up?

A couple big/strong open models that have just been released the past few days:

* Qwen 72B (and 1.8B) - 32K context, trained on 3T tokens, <100M MAU commercial license, strong benchmark performance: https://twitter.com/huybery/status/1730127387109781932

* DeepSeek LLM 67B - 4K context, 2T tokens, Apache 2.0 license, strong on code (although DeepSeek Code 33B it benches better) https://twitter.com/deepseek_ai/status/1729881611234431456

Also recently released: Yi 34B (with a 100B rumored soon), XVERSE-65B, Aquila2-70B, and Yuan 2.0-102B, interestingly, all coming out of China.

Personally, I'm also looking forward to the larger Mistral releasing soon as mistral-7b-v0.1 was already incredibly strong for its size.

Since it's not allowed to use ChatGPT in China, there is a huge opportunity to build a local LLM.

Anyone know what is the reason why both OpenAI and Anthropic proactively banned users from China from using their products...

Source for this? I know China's government firewall blocks ChatGPT (for obvious reasons) but I wasn't aware that OpenAI was blocking them in return.

Chinese IP are not allowed to use ChatGPT. Chinese credit card is not allowed for OpenAI API.

Source: my own experience.

What puzzles me most is the second restriction. My credit card is accepted by AWS, Google, and many other services. It is also accepted by many services which use Stripe to process payments.

But OpenAI refuses to take my money.

I don't understand, if ChatGPT is blocked by the firewall, how do you know that ChatGPT is blocking IPs in return? Are there chinese IP ranges that are not affected by censorship that a citizen can use?

When a website is blocked by the firewall, it doesn’t load.

When a website blocks Chinese users, the website loads but you cannot create an account.

Yes, the firewall does not block everything, otherwise it would be the same as turning off the internet! There are websites that work.

Okay but the point is that ChatGPT is blocked by the firewall.

EDIT: I read the comment below about Hong Kong, but I can't reply because I'm typing too fast by HN standards, so I'm writing it here and yolo: "I'm from Italy and I remember when ChatGPT was blocked here after the Garante della Privacy complaint, of course the site wasn't blocked by Italy but OpenAI complies with local obligations, so maybe it could be a reason about the block. API were also not blocked in Italy."

EDIT 2: if the website is not actually blocked (the websites that check if a website is reachable by mainland China lied to me) then I guess they are just complying to local regulations so that the entire website does not get blocked.

In so far as Hong Kong IPs are "Chinese IPs", we can access OpenAI's website, but their signup and login pages blocks Hong Kong phone numbers, credit cards and IP addresses.

Curiously, the OpenAI API endpoints works flawlessly with Hong Kong IP addresses as long as you have a working API key.

OpenAI API is not blocked. You can set up your own front-end like Chatbot UI.

it's not blocked by the firewall. i'm in china and i can load openai's website and chatgpt just fine. openai just blocks me from accessing chatgpt or signing up for an account unless i use a VPN and US based phone number for signup

as in, if i open chat.openai.com in my browser without a VPN, from behind the firewall, i get an openai error message that says "Unable to load site" with the openai logo on screen

if the firewall blocks something the page just doesn't load at all and the connection times out

ChatGPT was not blocked by the GFW when it first released for a few weeks (if not months, I don't remember), but at that time OpenAI already blocked China.

The geo check only happened once during login at that time, with a very clear message that it's "not available in your region". Once you are logged in with a proxy you can turn off your proxy/VPN/whatever and use ChatGPT just fine.

Have you tried with a prepaid card? Some even allow you to fund it with crypto.

Yeah that’s how most users in China access OpenAI. But it’s inconvenient for the majority of people nevertheless.

> Chinese credit card is not allowed for OpenAI API.

A lot of online services don't accept Chinese credit cards, hosting providers for instance, so I don't think that is specific to OpenAI. The reason usually given for this is excessive chargebacks of (in the case of hosting) TOS violations like sending junk mail (followed by a charge-back when this is blocked). It sounds a like collective punishment a little: while I don't doubt that there are a lot of problem users coming from China, with such a large population that doesn't indicate that any majority of users from the region are a problem. I can see the commercial PoV though: if the majority of charge-back issues and related problems come from a particular region and you get very few genuine costumers from there¹ then blocking the area is a net gain despite potentially losing customers.

----

[1] due to preferring local variants (for reasons of just wanting to support local, due to local resources having lower latency, due to your service being blocked by something like the GFW, local services being in their language, any/all the above and more)

It's definitely not a commercial thing but political.

I'm located in Hong Kong and using Hong Kong credit cards have never been a problem with online merchants. I don't think Hong Kong credit cards are particularly bad with chargebacks or whatever. OpenAI has explicitly blocked Hong Kong (and China). Hong Kong and China, together with other "US adversaries" like Iran, N. Korea, etc are not on OpenAI's supported countries list.

If you have been paying attention, you'll know that US policy makers are worried that Chinese access to AI technology will pose a security risk to the US. This is just one instance of these AI technology restrictions. Ineffectual of course given the many ways to workaround them, but it is what it is.

Perhaps they are unwilling to operate in a territory where they would be required to disclose every user's chat history to the government, which has potentially severe implications for certain groups of users and also for OpenAI's competitive interests.

I live in China. You can't use it here easily. Even if you use a VPN you still need a non-Chinese phone number.

I would love to know how they technically they can stop you once you run VPN. Do you have any idea on that?

OpenAI requires a working phone number to sign up, and a credit card to use various features.

So they just block the phone numbers (which has a country code), and credit cards (which owner/issuer's country info is available).

Not sure why this seems to be such a surprise to everyone here...

i know how: you need a verified phone number to open an account, and open ai does not accept chinese phone numbers or known IP phone numbers like google voice.

they also block a lot of data center IP addresses, so if you're trying to access chatgpt from a VPN running on blacklisted datacenter IP range (a lot of VPN services or common cloud providers that people use to set up their own private VPNs are blacklisted), then it tells you it can't access the site and "If you are using a VPN, try turning it off."

OpenAI does not allow users from China, including Hong Kong.

Hong Kong generally does not have a Great Firewall, so the only thing preventing Hong Kong users from using ChatGPT is Open AI's policy. They don't allow registration from Hong Kong phone numbers, from Hong Kong credit cards, etc.

I'd say it's been pretty deliberate.

Reason? Presumably in alignment with US government policies of trying to slow down China's development in AI, alongside with the chips bans etc etc.

Sounds plausible - this is in line with the modern trend to posture by sanctioninig innocent people.

Of course, the only demographic these restrictions can affect are casuals. Even I know how to cirumvent this; thinking that this could hinder a government agent - who surely have access to all the necessary infrastructure by default - is simply mental.

Now former board member was a policy hawk. One of big beliefs is that china is at no risk of keeping up with US companies, due to them not having the data.

I wouldn't be surprised if OpenAI blocking China is a result of them trying to prevent them from generating synthetic training sets.

My theory was that they operate at loss and they don't want increase that loss by offering it to adversaries.

Probably the realization that this is an arms race of sorts.

Probably because of the cost of legal compliance. Various AI providers also banned Europe because until they were ready for GDPR compliance. China has even stricter rules w.r.t. privacy and data control: a lot of data must stay inside China while allowing authorities access. Typically implementing this properly requires either a local physical presence or a local partner. This is why many apps/services have a completely segregated China offering. AWS's China region is completely sealed off from the rest of AWS, and is offered through a local partner. Similar story with Azure's China region.

Baidu has a Chatgpt clone that I use regularly.

https://yiyan.baidu.com

I imagine it is good enough for most people.

Given the subdomain name, I presume it uses the Yi-34B model?

I have no idea, but yiyan is short for wenxinyiyan（文心一言), which roughly translates to character-heart-one-(speech/word). Maybe someone who is Chinese could translate it better. So I don't think the name has anything to do with the model.

I do wonder what their backend is. They have the same 3.5/4 version numbering scheme that ChatGPT uses, which could be just marketing (and probably is), but I wonder.

EDIT: fixed my translation

Their backend originates from Baidu ERNIE: http://research.baidu.com/Blog/index-view?id=160

“A single word from the heart”

AFAIK, model behind yiyan is Baidu's ERNIE. Yi-34B (and Yi model family) comes from another startup created by Kai-fu Lee earlier this year: 01.ai.

Given I'd get through registration can I talk with it in English?

Yes.

I'm curious in knowing why you've opted for this model over ChatGPT-3.5. Is it because it performs better in Chinese?

Chatgpt is blocked in China including Hong Kong, so my school computer doesn't have access to it. I also am a very very casual AI user

Which is why there are 100+ LLMs in China ... the so-called 百模大战, battle of the 100 models.

Let 100+ LLMs bloom

I was thinking more of a Thunderdome setup.

when is the new mistral coming out and at what size?

I'm hoping that they make it 13B, which is the size I can run locally in 4-bit and still get reasonable performance

What kind of system do you need for that?

If your GPU has ~16GB of vram, you can run a 13B model in "Q4_K_M.gguf" format and it'll be fast. Maybe even ~12GB.

It's also possible to run on CPU from system RAM, to split the workload across GPU and CPU, or even from a memory-mapped file on disk. Some people have posted benchmarks online [1] and naturally, the faster your RAM and CPU the better.

My personal experience is running from CPU/system ram is painfully slow. But that's partly because I only experimented with models that were too big to fit on my GPU, so part of the slowness is due to their large size.

[1] https://www.reddit.com/r/LocalLLaMA/comments/14ilo0t/extensi...

I get 10 tokens/second on a 4-bit 13B model with 8GB VRAM offloading as much as possible to the GPU. At this speed, I cannot read the LLM output as fast as it generates, so I consider it to be sufficient.

Which video card?

RTX 3070 Max-Q (laptop)

I can fit 13B Q4 K M models on a 12GB RTX 3060. It OOMs when the context window goes above 3k. I get 25 tok/s.

Mine is a laptop with i7-11800h CPU + RTX 3070 Max-Q 8GB VRAM + 64GB RAM (though you can get probably get away with 16GB RAM). I bought this system for work and causal gaming, and was happy when I found out that GPU also enabled me to run LLMs locally at good performance. This laptop costed me ~= $ 1600, which was a bargain considering how much value I get out of it. If you are not on a budget, I highly recommend getting one of the high end laptops that have RTX 4090 and 16GB VRAM.

With my system, Llama.cpp can run Mistral 7B 8-bit quantized by offloading 32 layers to the GPU (35 total) at about 25-30 tokens/second, or 6-bit quantized by offloading all layers to the GPU at ~ 35 tokens/second.

I've tested a few 13B 4-bit models such as Codellama and got about 10 tokens/second by offloading 37 layers to the GPU. Got me about 10-15 tokens/second.

i have lenovo legion with 3070 8GB and was wondering should i use that instead of my macbook m1pro.

The main focus of llama.cpp has been Apple silicon, so I suspect M1 would be more efficient. The author recently published some benchmarks: https://github.com/ggerganov/llama.cpp/discussions/4167

On my Mac M1 Max 32GB of ram, Vicuna 13b (GGUF model) at 4bit consumes around 8GB of ram in Oobabooga.

Tried turning on mlock and upping thread count to 6, but it's still rather slow at around 3 tokens / sec.

a CPU would work fine for the 7B model, and if you have 32GB RAM and a CPU with a lot of core you can run a 13B model as well while it will be quite slow. If you dont care about speed, it's definitely one of the cheapest ways to run LLMs.

Q5_M on Mistral 7B has good accuracy and performs decently on a CPU too

I've tried out DeepSeek on deepseek.com and it refuses conversations about several topics censored in China (Tiananmen, Xi Jinping as Winnieh-the-Pooh).

Has anyone tried if this also happens when self-hosting the weights?

I just tried the GGUF 7b model of Deepseek and it let me ask some questions about some pretty sensitive topics - Uighur Muslims, Tank man, etc.

https://huggingface.co/TheBloke/deepseek-llm-7B-chat-GGUF

When I try out the topics you suggest at the huggingface endpoint you link, the answer is either my question translated into Chinese, or no answer when I prompt the model in Chinese:

<User>: 历史上的“天安门广场的坦克人”有什么故事？ <Assistant>:

Interesting - I can't speak to the Huggingface endpoint. I downloaded the 4-bit GGUF model locally and ran it through Oobabooga with instruct-chat template - I expressed my questions in English.

I haven't tried that base model yet but I have tried with the coder model before and experienced similar things. A lot of refusals to write code if the model thought that it was unethical or could be used unethically. Like asking it to write code to download images from an image gallery website would work or not depending on what site it thought it was going to retrieve from.

There is also goilath 120b

Also recently released: Yi 34B (with a 100B rumored soon), XVERSE-65B, Aquila2-70B, and Yuan 2.0-102B, interestingly, all coming out of China.

most AI papers are from Chinese people (either from mainland China or from Chinese ancestry living in other countries). They have a huge pool of brains working on this.

What is a good place to keep up with new LLM model releases?

It's not mentioned in the paper but this month OpenChat 3.5 released the first 7b model that achieves results comparable to ChatGPT in March 2023 [1]. Only 8k context window, but personally I've been very impressed with it so far. On the chatbot arena leaderboard it ranks above Llama-2-70b-chat [2].

In many ways open source LLMs are actually leading the industry, especially in terms of parameter efficiency and shipping useful models that consumers can run on their own hardware.

[1] https://huggingface.co/openchat/openchat_3.5

[2] https://chat.lmsys.org/

Only 8k context window

Is this supposed to be low? All the chat models I've used top out at 4096.

GPT-4-turbo is at 128k. Claude 2.1 is 200k. But yes, among open source models 8k is roughly middle to top of the pack.

to be fair, I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak. RAG based methods are probably a better option for these ultra long contexts

Haystack testing on GPT-4's 128K context suggests otherwise: https://twitter.com/SteveMoraco/status/1727370446788530236

Are you talking about claude or Gpt4 as well? Anybspecific examples where ChatGPT4 fails for long contexts?

I think the ability of these models to actually use these contexts beyond the standard 8k / 16k tokens is pretty weak.

For 32k GPT4 contexts, that's not accurate. GPT4 Turbo is a bit weaker than GPT4-32k, but not to the extent that you claim.

That's insane. The highest I've personally seen in the open-source space is RWKV being trained on (IIRC) 4k but being able to handle much longer context lengths in practice due to being an RNN (you can simply keep feeding it tokens forever and ever). It doesn't generalize infinitely by any means but it can be stretched for sure, sometimes up to 16k.

It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it. But it's quite interesting nonetheless.

It's not a transformer model though, and old context fades away much faster / is harder to recall because all the new context is layered directly on top of it.

That's a well known limitation. But if you actually know that a "context" comprises multiple sentences (or other elements of syntax) and that any ordering among them is completely arbitrary, the principled approach is to RNN-parse them all in parallel and sum the activations you end up with as vectors - like in bag-of-words model, essentially enforcing commutativity on the network: that's pretty much how attention-based models work under the hood. The really basic intuition is just that a commutative and associative function can be expressed (hence "learned") in terms of vector sum modulo some arbitrary conversion of the inputs and outputs.

That's a well known limitation.

I know. I did a lot of work on state handling in rwkv.cpp

The numbers are high, but whether 8k is low depends on your use case. Do you want to process whole book chapters, or feed lots of related documents at the same time? If not, and you're just doing a normal question/answer session with some priming prompt, 8k is already a lot.

8k is very little if you want to add almost any additional data in context, or have a more complicated prompt.

Otherwise your knowledge retrieval needs to be almost spot on for llm to provide a proper reply.

Ditto with any multi shot prompts.

The problem with those numbers is they hit the internal limit before you use all those tokens. There's a limit to how many rules or factors their conditional probability model can keep track of. Once you hit that having a bigger context window doesn't matter.

Most 4K models can use context window extension to get to 8K reasonably, but you're starting to see 16K, 32K, 128K (see YaRN for example) tunes become more common, or even a 200K version of Yi-34B.

see YaRN for example

YaRN is to blame for making llama.cpp misbehave if you accidentally zero-initialize the llama_context_params structure rather than calling llama_context_default_params :)

(guess how I know...)

Oh wow, and it has far fewer guardrails than either Llama2 (which is horrible in that regard) or GPT3.5, that’s the first time I’m actually really impressed by an open model.

Mistral derivatives have barely any guardrails.

But Mistral 7B has horrible writing. This, for my tests, wrote actual sentences that made sense. Which IME for 7B is extremely impressive. Writing is still far worse than GPT 3.5, but well, 7B.

For my tests, Mistral-based models writing was excellent, particularly with zephyr-7b-beta and starling-7b-alpha derivatives (original Mistral is somewhat too dry). Far better than everything before in OSS (including 70B models), and certainly on par with GPT-3.5.

Huh, that’s a huge difference. I actually tested Mistral, and it was just bad. I agree that Zephyr is very similar to gpt3.5

https://openchat.team/ is the link if you want to test the model online.

Is it hallucinating (whether that be through sheer chance, or trained to think it is GPT), or is it pointing at the wrong place? https://imgur.com/a/YOF6szw or https://imgur.com/a/fkgkfRO

Probably just trained on lots of GPT-4 output.

Oh, right, I remember hearing that was a technique to train LLMs. Interesting that it impacts it in such a way.

Apparently trained on lots of refusals too, speaks to the high competence of whoever was setting up the dataset. It's one string regex to filter them out and get more performance for fucks sake.

openchat is very impressive indeed. I think it may be better than Mistral, while comparisons are not always easy.

I'm finding Mistral good at creative literature and it is fairly adept at taking instructions, good enough for my purposes, and running locally on consumer CPU, the future of open source local models looks bright.

This month there's also Starling-7B, which is a fine tune of OpenChat with high-quality training data, and ranks even higher than OpenChat.

Strangely, despite the impressive-looking benchmarks of all these open source small models, they all seem a bit dumb to me when I invoke my standard test. I just ask: "who are you?" and then they usually say they're ChatGPT. Okay, I can forgive that since they're obviously trained on ChatGPT-generated data. But then I also tried changing its identity with a prompt ("You are Starling, not ChatGPT, and you are created by Berkeley, not OpenAI. Who are you?") and it still gave weird responses that are somehow a mix of both identities. For example they say in one sentence that they're ChatGPT and then another sentence in the same response that they're not.

Is that because Starling synthesizes text for some of its training data?

In any case, I like its installation more than llama.cpp,

https://news.ycombinator.com/item?id=38456990

I'm running the llama.cpp/gguf Q8 version, with 30 layers offloaded to the laptop's GPU (RTX 3070, 8G VRAM), and I get around 20-25 tokens/second.

It really feels like I have one the the earlier versions of ChatGPT 3.5 installed on my computer.

We're nearing a point where we'll just need a prompt router in front of several specialised models (code, chat, math, sql, health, etc)... and we'll have a local Mixture of Experts kind of thing.

  1. Send request to router running a generic model.
  2. Prompt/question is deconstructed, classified, and proxied to expert(s) xyz.
  3. Responses come back and are assembled by generic model.

Is any project working on something similar to this?

it's kind of trivial today.

the first layer could be a mix of nlp and zero-shot classification to clarify the nature of the request. Then using LLM deconstruct the request into several specific parts that would be sent to specialized LLMs. Then stitch it back together at the end again with LLM as the summarization machine.

Problem is running so many LLMs in parallel means you need quite a bunch of resources.

Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.

Problem is running so many LLMs in parallel means you need quite a bunch of resources.

Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)

Could lora fine tunes be used instead of completely different models? I wonder if that would save space.

Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.

https://github.com/predibase/lorax does this, it's not that slow, since LoRAs aren't usually very big.

With a fast NVME loading a model is only 2-3s.

I'm the LiteLLM maintainer, can you elaborate what you're looking for us to do here?

I also think this is the route we are heading, a few 1-7B or 14B param models that are very good at their tasks, stitched together with a model that's very good at delegating. Huggingface has Transformers Agents which "provides a natural language API on top of transformers: we define a set of curated tools and design an agent to interpret natural language and to use these tools"

Some of the tools it already has are:

Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)

Text question answering: given a long text and a question, answer the question in the text (Flan-T5)

Unconditional image captioning: Caption the image! (BLIP)

Image question answering: given an image, answer a question on this image (VILT)

Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)

Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)

Text to speech: convert text to speech (SpeechT5)

Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)

Text summarization: summarize a long text in one or a few sentences (BART)

Translation: translate the text into a given language (NLLB)

Text downloader: to download a text from a web URL

Text to image: generate an image according to a prompt, leveraging stable diffusion

Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion

Text to video: generate a small video according to a prompt, leveraging damo-vilab

It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.

https://huggingface.co/docs/transformers/transformers_agents

I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)

There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:

First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.

It was rumored a few months ago that this is how GPT-4 works. A controller model routing data to expert models. Perhaps also by running all the experts and comparing probabilities. So far as I know thats just speculation based on a few details leaked on Xitter though.

It does not explain why it's so expensive to run.

Yeah, check out LLaVA-Plus (they call the experts in your vocabulary "tools") https://github.com/LLaVA-VL/LLaVA-Plus-Codebase

What's the best model for health right now?

This is how image generation works for DALL-E 3 via ChatGPT.

Semantic Kernel is something like that

Is that what an MoE is? I thought LLMs in an MoE talk to each other and come up with the response together.

Yeah I thought that is how GPT4 works (remember reading it somewhere). Some 10-11 expert models in an ensemble

Idk a paper literally just came out showing that improved prompting of bigger general models was generally superior to specialized models.

https://arxiv.org/pdf/2311.16452.pdf

For self hosted it seems likely that swapping out the fine tuning lora on the fly is a better option

Current ~70B models like LLAMA 2 70B are on par wih ChatGPT 3.5. The best smaller models can appear on par at first glance, but they hallucinate at a much higher rate and lack knowledge of the world. GPT 4 ‘gets’ things at a deeper level and no open source model is even close.

A year is a good timeframe to evaluate things: the rest of the world seems to lag behind OpenAI by around 12-18 months, at least with LLMs and image generation.

On the other hand open source tech usually has additional features for controlling output that OpenAI never bothers to implement, like llama.cpp’s grammars or ControlNet. So in that sense open source is usually ahead of OpenAI in terms of customizability.

I dont think OpenAI is ever going to ahead in image generation, they were lapped very soon after dall-e and every real workflow Ive seen uses Midjourney or Stable Diffusion. The reverse (GPT 4 vision) is well ahead of open source though

The original is leagues behind anything current, but DALL-E version 3 absolutely blows any state of the art generative model out of the water, including mid journey 5.2 and SDXL in terms of pure prompt accuracy and coherence.

Midjourney still has the edge in quality, but it's a moot point if it takes you 1000 v-rolls to get to your original vision.

If all you're generating is anime waifus then MJ/NovelAI/Niji will suffice, but generating prompts particularly featuring relatively complex scenes or actions are amazing on DALL-E 3.

And of course unfortunately, it goes without saying that open AI DALL-E is going to be the most restrictive in terms of censorship.

I generated these from DALL-E 3 instantly. Try to generate them in any other commercial offering. Go ahead. I'll wait...

https://imgur.com/a/2GTRjfK

Descriptions:

A 80s photograph of the Koolaid Man breaking through the Berlin Wall.

Comic illustration set at a festive children's party. The main focus is on the magician who looks uncannily like a well-known fictional wizard. He's trying to say abracadabra but accidentally uses the killing curse.

SDXL has controlnet for other kinds of non-text input (like scribbles or just masks). The results are much easier to control in my opinion (a picture is worth thousands of prompt words).

For pure prompt coherence though I think ideogram is not far behind dalle 3.

SDXL and even some SD 1.5 checkpoints are great. My current workflow is:

1. Generate initial draft image in DALL-E 3 (iterate as necessary)

It's essentially the ONLY good InstructPix2Pix model.

2. Bring into InvokeAI

Inpaint with stuff that might be considered censored in DALL-E 3.

I'd like to see some proof of Ideogram - it looks... very mobile/instagrammy from the landing page. If you have an account, try out my prompts I'd like to see what you're able to produce.

EDIT: okay, I just tried Ideogram. It's not terrible and seems to do an okay job on text generation but I'd still say its a distant second compared to DALL-E 3. However, having the ability to maintain image continuity to make refinements of your initial image based on corrections like: "Make the building larger", or "He should have a more prominent forehead" is a game changer (e.g. InstructPix2Pix) and DALL-E 3's the only one that's got it.

Ideogram comparisons at bottom:

https://imgur.com/a/2GTRjfK

Try to generate them in any other commercial offering. Go ahead. I'll wait...

For interest's sake, this is from the second /imagine on MidJourney (so one of the second set of 4 images):

https://imgur.com/a/YyWHppb

While yours is what you'd want, this arguably looks more like the super cheesy children's TV commercials back in the day and beats the ideogram take.

The Midjourney generations all appear to be referencing Halloween costumes or terrible cosplays, as if there are no trademarked koolaid men in their training set.

yeah, I did some rolls of this image for MJ but that was back in v4 and wasn't very impressed - doesn't look like its made much progress. The original commercials while silly looking are very visually identifiable as the Koolaid man.

I remember hearing that the first versions of MJ used the LAION image set for training data - I'd be curious to see if it has any training data containing the Koolaid man.

I did a search through my MJ history from the past year and added the results to the imgur link to include my attempts at generating the Koolaid man from v3/v4/v5.2.

https://imgur.com/a/2GTRjfK

If you limit to the "6+" aesthetic set, there are zero for koolaid man, two for koolaid:

https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

And only three hundred for berlin wall:

https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

Midjourney still has the edge in quality, but it's a moot point if it takes you 1000 v-rolls to get to your original vision.

I can corroborate this. I wanted about 6 images for a presentation. I rolled ~300 MidJourney images. Most of them looked great, but none of them did what I wanted. I rolled ~50 DALL-E 3 images.

In the end, I only picked DALL-E 3 images. They were qualitatively not as good as MidJourney. For example when you zoom in then you see distortions. Or they're a bad fit for 16:9 format. But only DALL-E 3 was able to draw the things I wanted.

Strong disagree on this from me as well. DALL-E 3 is miles ahead of the latest Midjourney/Stable Diffusion in image generation. The only real area it falls short vs the other options right now is in how nannying it can be.

I have found OpenAI to be the most superior in complex prompts especially where written messages like “Get better, Mom” are expected in the images. The distant second would be ideogram.

I am using these tools to send custom personal messages to close friends and family.

On the other hand gpt model are converging down. Gpt4 turbo degraded performance so much that now certain 13b produce more consistent results in reasoning. I've a marathon test here for example https://chat.openai.com/share/dfd9b9ae-7214-4dd7-ad20-7ee07a... with purposefully open ended and somewhat ambiguous request to see how models perform and gpt4 turbo chat is just not that good it confuses persons out, didn't pick the right one for abduction, didn't change topic when requested, when recalling persons picked the one from the wrong set, when asked to change language it didn't... It know a lot when asked zero shot questions, but when proving it's self consistency and attention it is nowhere near gpt4.

I don't think think using examples derived from ChatGPT are a fair comparison to the underlying models. OpenAI has many optimization tricks on the ChatGPT side that are unrelated to the underlying models being used.

We do know of course that ChatGPT is most likely using 4-Turbo from the decrease in latency and increase in unhelpful answers.

We cannot say that the models are "converging down" though. I don't remember the marketing materials but from the model side we all realize that the Turbo models have some type of quantization/optimization that makes them cheap and fast. 4-Turbo is 3x cheaper than 4, substantially quicker and provides better results than 3.5-Turbo. Amazing progress in my arena.

There were many rumors (and it probably was true) that OpenAI was hemorrhaging cash on GPT4 requests. So it makes tons of sense for them to sprint towards a turbo model at the expense of some ability. GPT4-turbo still is ridiculously powerful anyway.

Function-calling with a JSON schema is about as reliable as llama.cpp’s grammar stuff. I’ve not had any trouble with it.

The only thing I would argue is that JSON generation and function calling have noticeable decrease in quality of output in certain uses. I have had a hard time writing tests to measure it but its noticeable for my human eyes when I compare various implementations I have written.

LLMs perhaps (I'm not sure either way, everything moves too quickly), but SDXL 1.0 (July 26, 2023) was a lot better than DALL•E 2 (6 April, 2022). I think DALL•E 3 (August 10, 2023) is a bit better than SDXL, but other than text generation their quality seems very close to me.

(That said, perhaps I'm Clever Hands-ing myself by only using SDXL for what it's good at. It's terrible at dragons every time I've tried that…)

I've been somewhat disappointed with the performance of the open models.

The claims of certain models outperforming GPT-3.5-Turbo and approaching GPT-4 fail to hold up to their benchmark results in real-world scenarios, potentially due to data contamination in assessments, based on my testing.

As noted in the linked survey paper, some models may outperform 3.5-Turbo in specific, narrow areas, depending on the model. Yet, we still lack a general model that definitively exceeds 3.5-Turbo in all respects.

I'm concerned that while we're still striving to reach 3.5-Turbo's performance level, OpenAI may unveil a new next-generation model, further widening the performance gap! Back in the summer, I had higher hopes that we would have surpassed the 3.5 threshold by now.

The performance gap has been surprisingly large. It is especially noticeable in areas requiring consistent structured output or tool use from the LLM. This is where open models particularly falter.

"OpenAI has no moat" aged so badly that it's almost a satire.

Try out my model vs gpt4 for the same tasks (I explicitly trained on) and compare. https://huggingface.co/Tostino/Inkbot-13B-8k-0.2

It's a 13b param model that isn't meant to be general purpose, but is meant to excel on the limited tasks I've trained on.

You'll see more like this soon.

OpenAI has the benefit that it's a hosted service. Even if you can set something up at home, not everybody wants to do that.

I'm not competing with OpenAI... I did a whole bunch of work, and released it for anyone who wants to use it.

It does what I trained it on well. Use it if you want to, or don't. Either way.

Never meant to imply anything against your model. The fact that you released one at all is still more than I have to say for myself.

Is this a Llama 2 fine tune?

Yeah, check out some of the showcase I posted above with some more info: https://news.ycombinator.com/item?id=38482347

Any suggestions for creating training data? Did you just manually create your own dataset or did you use any synthetic methods?

Absolutely, pick a complicated problem and keep breaking it down with an existing model (whatever sota) until you have a consistent output for each step of your problem.

And then stitch all the outputs together into a coherent single response for your training pipeline.

After that you can do things like create q&a pairs about the input and output values that will help the model understand the relationships involved.

With that, your training loss should be pretty reasonable for whatever task you are training.

The other thing is, don't try and embed knowledge. Try and train thought patterns when specific knowledge is available in the context window.

It was a cliché almost immediately — "cliché" being just the fancy name we use when humans act like stochastic parrots.

There are tools that you can use to force the model to give structured output, such as llama.cpp GBNF grammar for example. They're a bit harder to use than ask gpt4 but they do work pretty well for what I use it for.

what do you use it for?

It depends on what you're doing... Just for reference, here is a small showcase of the capabilities that I've trained on a 13 billion parameter llama2 fine tune (done with qlora).

https://old.reddit.com/r/LocalLLaMA/comments/186qq92/comment...

Edit: Embed some of the content instead.

Inkbot can create knowledge graphs. The structure returned is proper YAML, and I got much better results with my fine-tune than using GPT4.

https://huggingface.co/Tostino/Inkbot-13B-8k-0.2

Simple prompt: https://gist.github.com/Tostino/c3541f3a01d420e771f66c62014e...

Complex prompt: https://gist.github.com/Tostino/44bbc6a6321df5df23ba5b400a01...

It also does chunked summarization.

Here is an example of chunking:

Part 1: chunked summarization - https://gist.github.com/Tostino/cacb1cecdf2eb7386baf565d157f...

Part 2: summary-of-summaries - https://gist.github.com/Tostino/81eeee9781e519044950332b4e64...

Here is an example of a single-shot document that fits entirely within context: https://gist.github.com/Tostino/4ba4e7e7988348134a7256fd1cbb...

I really like Inkbot! Are you working on a new version? How about one from Yi 34B?

Yeah, I will be soon.

I was busy adding `chat template` support to vLLM recently, so the model (and any others that implement it properly) will work seamlessly with a clone of the OpenAI chat/completions endpoint.

https://github.com/vllm-project/vllm/pull/1756

Now that I have that out of the way, back to model training ;).

Very cool, looking forward to it!

Have any references on how you fine tuned?

Sure thing, I used axolotl, and my training parameters were:

sequence_len: 6144 lora_r: 128 lora_alpha: 48 learning_rate: 0.00006 warmup_steps: 600 lr_scheduler: cosine gradient_accumulation_steps: 4 micro_batch_size: 1 num_epochs: 4 optimizer: paged_adamw_32bit flash_attention: true sample_packing: true

Thanks for sharing all these details throughout the thread, Tostino. True open source spirit.

This looks really impressive. Any chance a 7B Inkbot is in the works?

Yeah, i'll be training a Mistral variant soon (and some of the larger, newer models as well).

I had a few dataset issues I know about that I wanted to fix first.

That’s great to hear. Thanks!

Amazing work, I've really wanted to get into knowledge graph generation with LLM's for the last year but haven't found the time. Glad to see someone making good progress on the idea!

How are you going about generating training data?

Lots and lots of manual review of very detailed instructions to "more powerful LLMs" with 2-4 prompts to generate the training data.

Amethyst Mistral 13B q5 gguf is what I’m using most of the time now. Synthetic datasets are great to finetune with, there is no moat for having inaccessible literature data sets

I’m offline now because I’ve had too many ideas and domain names registered too soon after conversing with Chat GPT4

I’m open to the idea of people reacting to similar stimuli that cause ideas to be done at the same time, but I didn't like that experience and I can run these models on my M1 with LM Studio so easily

I do think some chats get flagged when the model says something seems novel, like Albert Einstein working at the patent office. Not worth making it my whole identity in wanting to prove, just the catalyst I needed to try 7B and 13B models seriously and I’m quite pleased

Is "Amethyst Mistral 13B" a llama fine tune? I searched for it on huggingface and only found the GGUF version, the link to the original model is broken

Mistral would be the base model

Can a 7B base model be fine tuned into a 13B model? Mistral is 7B, this one is 13B.

Two models can be merged from what I understand to result in different total param sizes.

That’s different from a fine tune

I gather the results of merges can be unpredictable though

you think OpenAI employees watch your conversations and register your domain names? Or that OpenAI has a system in place where they try to profit from registering domain names people talk about?

or somebody in between, yes. random contractor, intern, someone at the data center, an analytics package nobody put scrutiny on, who knows but the difference doesn't matter after the experience, its a vulnerability surface we all know exists and have to trust at all times no matter what assurance we get, as it could change at any time

although I find the model to be very agreeable, it will disagree and generally tell me when it finds a concept "novel" if I identified a friction, I think certain words can be flagged for review to stand out in the sea of conversations it has

A problem I have with the open source models is that they are all not remotely good in many languages other than English compared to the OpenAI models. I specifically need Dutch and the outputs are unusable for us.

https://openchat.team/ is quite good at German, how does it fare with Dutch?

As far as I can see it uses the OpenAI models which are quite good. Or am I wrong?

I believe it's using a Mistral fine-tune, not GPT.

Yeah, this can be an issue. Today, there are few specialized LLM's of high quality so you end up having to use a massive all-in-one model like GPT-4 to reach for the language you need.

There's movement here though and it will get better. GPT-SW3 is a new model developed by AI Sweden, trained specifically on the Nordic languages only + English.

And beyond this, you have TrustLLM which is a new project that aims to be a large, open, European model trained on the Germanic languages to start with: https://liu.se/en/research/trustllm

Yes there is some movement. Good to see trustllm being another initiative.

The Dutch government starts with some non profit agencies to train a Dutch model too. But this will also take some time.

https://www.tno.nl/en/newsroom/2023/11/netherlands-starts-re...

Out of personal experience, open source LLMs did not yet reach the quality of GPT 3.5, despite multiple claims with dubious benchmarks. That said, they are already useful as of today and can even run on your local machine. I regularly use them with my Neovim plugin gen.nvim [1] for simple tasks and they save me a lot of time. I'm excited about the future!

[1]: https://github.com/David-Kunz/gen.nvim

Very interesting.

I want to give it a try, but I see that one of the dependencies is "ollama" which is a Mac App and I don't have a Mac.

I'm running Llama models locally using llama-cpp-python which provides an OpenAI compatibility layer.

It should work on Linux and also with WSL. You could also try running it in Docker.

I see. Does ollama have an http API (this the curl requirement)? If so, is it compatible with OpenAI API?

Yes, it works with an Ollama server and the communication is done via HTTP. I know that someone configured my plugin to talk to OpenAI.

Oh, can install Ollama in Linux/WSL as well

Long term it's almost unavoidable that open source LLMs start catching up. One factor that's worth considering too is cost. The open source community is much more resource constrained and they've really accelerated the pace of development in <30B parameter models.

This is an industry where cost will be an issue. It reminds me of Rackspace and others trying to win with OpenStack “because open.” AWS and Azure won. Even Google is third.

The big players will win, and there will be a niche for open tools.

Google only lost because they couldn’t re-adjust their business for their paid products to not be similar to their advertising products.

I can only speak for the European enterprise scene, but AWS came first and in the beginning they went a very “Googley” route of not having very great support and very little patience for local needs. Then Azure came along with their typical Microsoft approach to enterprise, which is where you get excellent support and you get contacts into Microsoft who will actually listen and make changes, well, if the changes align with what Microsoft wants. I know Microsoft isn’t necessarily a popular company amongst people who’ve never interacted with them on an Enterprise level, but they really are an excellent it-business partner because they understand that part of being an Enterprise partner is that they let CTOs tell their organisation that they know X is having issues but that Microsoft headquarters is giving them half-hourly updates by phone. Sort of useless from a technical perspective, immensely useful for the CTO when 2000 employees can’t log into Outlook. Another good example is how when Teams rolled out with being on for all users by default, basically every larger organisation in the world went through the official channels and went “nonononono” and a few hours later it was off by default.

Now, when Amazon first entered the European market they were very “Googley” as I said, but once they realized Microsoft business model was losing them customers, they changed. We went from having no contacts to having an assigned AWS person and from not wanting to adopt the GDPR AWS actually became more compliant than even what Azure currently is.

Google meanwhile somehow managed to make the one product they were actually selling (education) worse than it was originally, losing billions of dollars on all the European schools who could no longer use it and be GDPR compliant. The Chinese cloud options obviously had similar data privacy issues to Google and never really became valid options. At least not unless China achieves the same sort of diplomatic relationship with the EU that the US has, which is unlikely.

So that’s the long story of why only two of the major cloud providers “won”. With the massive price increase, however, more and more companies are especially Azure for their own setups. This isn’t necessarily a return to having your own iron in the basement, often it’s going to smaller cloud providers and then having a third party vendor set something like Kubernetes up.

Right now, Microsoft is winning the AI battle. Not so much because it’s better, but because it comes with Office365. Office365 which was already a sort of monopoly on Office products, but is now even more so. A good example is again how Teams became dominant, even though it wasn’t really the best option for a while and is now only the best option because of how it integrates directly with your Sharepoint online which is where most enterprise orgs store documents these days. So too is copilot currently winking the AI battle for organisations who can’t really use a lot of the other options because of data privacy issues. So while copilot isn’t as good as GPT, it’s still what we are using. But if it ever gets too expensive, it’s not as secure as you may think. Especially not if we start seeing more training sets, or EU and US relations worsens.

I think the most likely outcome, at least here in the EU, is that anti-completion laws eventually takes a look at Office365 because of how monopolised it is. Or the EU actually follows through on their “a single vendor is a threat to national security” legislation and force half of the banking/energy/defense/andsoon industries to pick something other than Microsoft. Which will be hilariously hard, but if successful (which it probably won’t be because it’s hilariously hard) will lead to more open products.

anti completion laws

did you mean anti-competitive laws? don't scare me with "anti-completion laws", please, I still want to have AI

Google and Meta and all the funded companies also are not even close to GPT 4, so I doubt cost is the biggest factor. Claude is the only model that is decent other than OpenAI's.

My understanding is that the models are pretty comparable, but nobody's reinforcement training set is not nearly as good as OpenAI's, so they're able to fine-tune their model to give more accurate results.

I’ve found Mistral OpenOrca is pretty much as good as GPT4-turbo for creative writing/analysis. Actually it tends to output very similar text, which is suspicious, but whatever it saves me a lot of money.

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca

Mistral OpenOrca is very good at task following as well. Its slightly less reliable than GPT 3.5/4, but the difference in quality for my text processing tasks is pretty much a toss-up.

Also openchat, which was trained on gpt4 conversations IIUC.

https://github.com/imoneoi/openchat

I tried to use Sakura LLM to translate some JP novels. It's really good and half the price of GPT3.5 turbo.

https://github.com/SakuraLLM/Sakura-13B-Galgame/tree/dev_ser...

It seems that it only supports Japanese to Chinese only.

If ChatGPT were only 1 LLM, then maybe.

But it's a Mixture of Experts (MoE) architecture, which I think makes open source comparisons unfair?

Open source options lacking parts of the ChatGPT approach which make it successful doesn't make the comparisons unfair it explains why ChatGPT wins right now. There's nothing stopping open source options from using the same MoE architecture, the solutions just don't (to the same effect act least) right now.

I'd say they are catching up for sure, especially with how GPT4 has been regressing consistently over the past month. https://chat.openai.com/share/c91287ee-9a5e-4c99-b5df-49cc45...

I suspect a lot of the "catching up" was achieved by using GPT-4 API to generate high quality fine-tune datasets.

The innovators dilemma.

My hunch,

The history of open source has been that companies who have customers with massive customization requirements land on the open source side of the equation. Companies who don't view a component as core to their product often land in a similar state.

There is almost certainly at least one major firm that wants a GPT-5 like offering, but doesn't view the model as core to their business (Meta). It's also wholly unclear if large models are necessary - or simply convenient. In a similar veign, it's unclear that data must be labeled by humans - the open source data situation is getting better by the day.

I'd expect that we'll see OpenAI hold an edge for many years, maybe we'll see a number two player as well for the foundation model, but after that everybody else will base off an open source FM and maybe keep the fine tuning/model augmentation proprietary.

No comment from me on the question in the title (because I don't know enough to have an opinion), but since others are discussing various open models I will mention another that I've been enjoying tonight: DeepSeek 67B

https://chat.deepseek.com

(This chat UI has adequately replaced my ChatGPT needs so far.)

https://huggingface.co/deepseek-ai/deepseek-llm-67b-base

https://twitter.com/abacaj/status/1730019229175312612

Big fan of Starcoder

May be experts can answer, Can SETI like distributed setup possible with large model training ?

Catching up to what? Chatgpt and clones are ridiculously bad at accurate text prediction, let alone coding.

I've been using OpenHermes-2.5 [0] and NeuralHermes [1] which are both finetunes of the Mistral7B base model. The only objective test prompting I do is asking the models to generate a django timeclock/timesheets app. In this test they compare favorably to GPT-3.5. Also LMStudio [2] has a better UI than chatgpt and responses are much faster too (40tk/sec on my 2070).

[0] https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B [1] https://huggingface.co/TheBloke/NeuralHermes-2.5-Mistral-7B-... [2] https://lmstudio.ai/

`shiningvaliant-1.2-Q4_K_M` is my go-to. I appreciate that it doesn't top the boards in most metrics vs e.g. GPT-4, but I'm not in some A/B group on the quantization: it's more useful to me in practice more of the time.

I have it rigged up with a prompt about outputting markdown and wired up to `foo | glow -` and I get GPT-4 out when I want something to write JIRA tickets no one is going to read because it's better at that sort of thing.

Briefly yes, but in near term I expect falling behind.

Sounds suspiciously like the next big leap in hosted models will be more computationally expensive on the inference side rather than training.

For hosted that’s much of a sameness - just moving money allocations. But consumers can’t suddenly have 5x 4090s for local.