This is the only LLM that is exciting to me. Clearly, LLMs are powerful tools that may end up replacing search and even go much further than simple searches by performing the research for you and producing final answers. Closed models like those from Open AI (ironically) or Anthropic cannot be audited. When most users will end up blindly hitting Microsoft’s Copilot button, which they are forcing OEMs to adopt, who’s to say how the information a user gets is being curated or manipulated by OpenAI or Microsoft or whoever?
We’ve already seen real world examples of severe bias injected into LLMs. For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...). I don’t think we can just let closed AI systems take over society when they can easily be manipulated by the model owners without transparency.
What I like about AI2’s approach with OLMo is that they are actually open, not just trading on the marketing benefits of the word “open”. Most “open” models are just open weights not open source. That’s like sharing an executable and not the source code. In my view, being open means that others have to be able to reproduce the final product (the model) if they wanted to and had the means (in terms of training hardware). It also means that they should be able to use whatever is provided freely for any purpose, rather than being subject to proprietary licensing. AI2 shares the training source code, training data, evaluation suite, and the model weights that they’ve produced by running the training process. It all uses the Apache license. And it’s also interesting that they used AMD hardware to train this LLM rather than Nvidia/CUDA.
Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others. My hope is that truly open models like OLMa keep developing quickly enough to also keep up. Lastly, I hope that regulation does not block open source private development of AI systems. These systems will be the vehicle for speech for much of society in the future, so blocking private AI systems is a lot like restricting speech. But leaving that aside, open development will also drive innovation and reducing competitive pressure will hurt innovation.
Pet peeve: Google's Gemini LLM model was not to blame for the image generation weirdness.
That would be like blaming DALL-E weirdness on GPT-4.
Unfortunately, Google marketing decided to slap the "Gemini" brand on both the end-user interface used to interact with the model AND the actual model itself, hence people constantly calling out Gemini-the-model for weird decisions made as part of Gemini-the-user-interface.
The way I read the Gemini technical report, it seemed like, unlike GPT-4 vs DALL-E, Gemini was pretrained with multimodal outputs. Is that not the case?
Is that right? I didn't think Gemini was generating images directly, I assumed it was using a separate image generation tool.
The paper here https://arxiv.org/pdf/2403.05530.pdf has a model card for Gemini 1.5 Pro that says:
Huh, that is true in both the model cards of Gemini 1.5 Pro and Gemini 1.0.
That feels like it runs counter to this statement from the Gemini 1.0 technical report[0]:
[0]: https://arxiv.org/pdf/2312.11805.pdf
Yeah what does that bit about "image outputs" mean I wonder?
Did anybody manage to get the entire prompt out of gemini, or what are you basing your claim on?
That's my point. The system prompt isn't part of the model - it's part of the UI system that wraps the model.
> That would be like blaming DALL-E weirdness on GPT-4.
Actually when you trigger DALL-E through GPT-4 (i.e. with the LLM generating the prompt to give the diffusion model then returning the resulting image to the user) the LLM's system instructions [1] say "7. Diversify depictions of ALL images with people to always include always DESCENT and GENDER for EACH person using direct terms." and a bunch of stuff along those lines.
In OpenAI's system this doesn't always trigger; if the user asks for an image of trash being collected, the user hasn't explicitly asked for any people to be depicted, so the LLM doesn't find anything in the prompt that needs diversity added. The trash-being-collected prompt gets passed to DALL-E unmodified, and the resulting image has all male workers.
[1] https://raw.githubusercontent.com/spdustin/ChatGPT-AutoExper...
Yeah, I wrote about that last year: https://simonwillison.net/2023/Oct/26/add-a-walrus/#diversif...
Again, that's not a GPT-4 thing: that's a ChatGPT interface running GPT-4 with DALL-E as a tool thing.
Such a bizarre take to call this "dystopian".
The model happened to create some out-there pictures. I mean, it's no more outlandish then giant dragons and snakes and such being created yet the thought of a person of color being something historically inaccurate is this massive outcry against revisionism? Who cares?
Besides, the article identifies the probable goal which was to eliminate very known biases in existing models (i.e. when generating "angry person" you mainly got black people). Clearly this one wasnt tuned well for that goal, but the objective is not only noble but absolutely should be required for anyone producing LLM models.
Right, "who cares" about the truth in our dystopian world? 1984 is apparently too long ago for people to remember the ministry of truth...
If I may explain: the dystopian part to me is the lack of transparency around training code, training data sources, tuning, meta prompting, and so forth. In Google’s case, they’re a large corporation that controls how much of society accesses information. If they’re secretly curating what that information is, rather than presenting it as neutrally as they can, it does feel dystopian to me. I’d like transparency as a consumer of information, so I know to the extent possible, what the sources of information were or how I am being manipulated by choices the humans building these systems made.
I appreciate the issue you’re drawing attention to in the example you shared about images of an angry person. I think I agree that focused tuning for situations like that might be noble and I would be okay with a model correcting for that specific example you shared. But I also struggle with how to clearly draw that line where such tuning may go too far, which is why I favor less manual biasing. But I disagree that such tuning should be required, if you meant required by the law. Like with speech or art in general, I think anyone should be able to produce software systems that generate controversial or offensive speech or art. Individual consumers can choose what they want to interact with, and reject LLMs that don’t meet their personal standards.
One thing I wanted to add and call attention to is the importance of licensing in open models. This is often overlooked when we blindly accept the vague branding of models as “open”, but I am noticing that many open weight models are actually using encumbered proprietary licenses rather than standard open source licenses that are OSI approved (https://opensource.org/licenses). As an example, Databricks’s DBRX model has a proprietary license that forces adherence to their highly restrictive Acceptable Use Policy by referencing a live website hosting their AUP (https://github.com/databricks/dbrx/blob/main/LICENSE), which means as they change their AUP, you may be further restricted in the future. Meta’s Llama is similar (https://github.com/meta-llama/llama/blob/main/LICENSE ). I’m not sure who can depend on these models given this flaw.
Do we even know if these licenses are binding? AFAIK we have no ruling on whether model weights are even eligible for copyright. They're machine-produced derivatives of other work, so it's not a guarantee that copyright protects them.
That’s a great point and I hope more people speak up to treat models as just numerical derivative works so they aren’t automatically granted these protections. It’s better if society meaningfully debates this and chooses the right approach.
Since when? I’ve had the complete opposite experience.