return to table of content

Hello OLMo: A truly open LLM

blackeyeblitzar
16 replies
21h1m

This is the only LLM that is exciting to me. Clearly, LLMs are powerful tools that may end up replacing search and even go much further than simple searches by performing the research for you and producing final answers. Closed models like those from Open AI (ironically) or Anthropic cannot be audited. When most users will end up blindly hitting Microsoft’s Copilot button, which they are forcing OEMs to adopt, who’s to say how the information a user gets is being curated or manipulated by OpenAI or Microsoft or whoever?

We’ve already seen real world examples of severe bias injected into LLMs. For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...). I don’t think we can just let closed AI systems take over society when they can easily be manipulated by the model owners without transparency.

What I like about AI2’s approach with OLMo is that they are actually open, not just trading on the marketing benefits of the word “open”. Most “open” models are just open weights not open source. That’s like sharing an executable and not the source code. In my view, being open means that others have to be able to reproduce the final product (the model) if they wanted to and had the means (in terms of training hardware). It also means that they should be able to use whatever is provided freely for any purpose, rather than being subject to proprietary licensing. AI2 shares the training source code, training data, evaluation suite, and the model weights that they’ve produced by running the training process. It all uses the Apache license. And it’s also interesting that they used AMD hardware to train this LLM rather than Nvidia/CUDA.

Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others. My hope is that truly open models like OLMa keep developing quickly enough to also keep up. Lastly, I hope that regulation does not block open source private development of AI systems. These systems will be the vehicle for speech for much of society in the future, so blocking private AI systems is a lot like restricting speech. But leaving that aside, open development will also drive innovation and reducing competitive pressure will hurt innovation.

simonw
8 replies
20h14m

Pet peeve: Google's Gemini LLM model was not to blame for the image generation weirdness.

That would be like blaming DALL-E weirdness on GPT-4.

Unfortunately, Google marketing decided to slap the "Gemini" brand on both the end-user interface used to interact with the model AND the actual model itself, hence people constantly calling out Gemini-the-model for weird decisions made as part of Gemini-the-user-interface.

espadrine
3 replies
8h52m

Google's Gemini LLM model was not to blame for the image generation weirdness. That would be like blaming DALL-E weirdness on GPT-4.

The way I read the Gemini technical report, it seemed like, unlike GPT-4 vs DALL-E, Gemini was pretrained with multimodal outputs. Is that not the case?

simonw
2 replies
4h36m

Is that right? I didn't think Gemini was generating images directly, I assumed it was using a separate image generation tool.

The paper here https://arxiv.org/pdf/2403.05530.pdf has a model card for Gemini 1.5 Pro that says:

    Output(s): Generated text in response to the input
    (e.g., an answer to the question, a summary of
    multiple documents, comparing documents/videos).

espadrine
1 replies
4h18m

Huh, that is true in both the model cards of Gemini 1.5 Pro and Gemini 1.0.

That feels like it runs counter to this statement from the Gemini 1.0 technical report[0]:

Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs

[0]: https://arxiv.org/pdf/2312.11805.pdf

simonw
0 replies
3h33m

Yeah what does that bit about "image outputs" mean I wonder?

yk
1 replies
17h35m

Did anybody manage to get the entire prompt out of gemini, or what are you basing your claim on?

simonw
0 replies
4h36m

That's my point. The system prompt isn't part of the model - it's part of the UI system that wraps the model.

michaelt
1 replies
10h48m

> That would be like blaming DALL-E weirdness on GPT-4.

Actually when you trigger DALL-E through GPT-4 (i.e. with the LLM generating the prompt to give the diffusion model then returning the resulting image to the user) the LLM's system instructions [1] say "7. Diversify depictions of ALL images with people to always include always DESCENT and GENDER for EACH person using direct terms." and a bunch of stuff along those lines.

In OpenAI's system this doesn't always trigger; if the user asks for an image of trash being collected, the user hasn't explicitly asked for any people to be depicted, so the LLM doesn't find anything in the prompt that needs diversity added. The trash-being-collected prompt gets passed to DALL-E unmodified, and the resulting image has all male workers.

[1] https://raw.githubusercontent.com/spdustin/ChatGPT-AutoExper...

gremlinunderway
2 replies
20h22m

For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...).

Such a bizarre take to call this "dystopian".

The model happened to create some out-there pictures. I mean, it's no more outlandish then giant dragons and snakes and such being created yet the thought of a person of color being something historically inaccurate is this massive outcry against revisionism? Who cares?

Besides, the article identifies the probable goal which was to eliminate very known biases in existing models (i.e. when generating "angry person" you mainly got black people). Clearly this one wasnt tuned well for that goal, but the objective is not only noble but absolutely should be required for anyone producing LLM models.

lynx23
0 replies
12h27m

Right, "who cares" about the truth in our dystopian world? 1984 is apparently too long ago for people to remember the ministry of truth...

blackeyeblitzar
0 replies
20h7m

If I may explain: the dystopian part to me is the lack of transparency around training code, training data sources, tuning, meta prompting, and so forth. In Google’s case, they’re a large corporation that controls how much of society accesses information. If they’re secretly curating what that information is, rather than presenting it as neutrally as they can, it does feel dystopian to me. I’d like transparency as a consumer of information, so I know to the extent possible, what the sources of information were or how I am being manipulated by choices the humans building these systems made.

I appreciate the issue you’re drawing attention to in the example you shared about images of an angry person. I think I agree that focused tuning for situations like that might be noble and I would be okay with a model correcting for that specific example you shared. But I also struggle with how to clearly draw that line where such tuning may go too far, which is why I favor less manual biasing. But I disagree that such tuning should be required, if you meant required by the law. Like with speech or art in general, I think anyone should be able to produce software systems that generate controversial or offensive speech or art. Individual consumers can choose what they want to interact with, and reject LLMs that don’t meet their personal standards.

blackeyeblitzar
2 replies
20h48m

One thing I wanted to add and call attention to is the importance of licensing in open models. This is often overlooked when we blindly accept the vague branding of models as “open”, but I am noticing that many open weight models are actually using encumbered proprietary licenses rather than standard open source licenses that are OSI approved (https://opensource.org/licenses). As an example, Databricks’s DBRX model has a proprietary license that forces adherence to their highly restrictive Acceptable Use Policy by referencing a live website hosting their AUP (https://github.com/databricks/dbrx/blob/main/LICENSE), which means as they change their AUP, you may be further restricted in the future. Meta’s Llama is similar (https://github.com/meta-llama/llama/blob/main/LICENSE ). I’m not sure who can depend on these models given this flaw.

idle_zealot
1 replies
19h13m

Do we even know if these licenses are binding? AFAIK we have no ruling on whether model weights are even eligible for copyright. They're machine-produced derivatives of other work, so it's not a guarantee that copyright protects them.

blackeyeblitzar
0 replies
18h11m

That’s a great point and I hope more people speak up to treat models as just numerical derivative works so they aren’t automatically granted these protections. It’s better if society meaningfully debates this and chooses the right approach.

theshackleford
0 replies
13h48m

Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others.

Since when? I’ve had the complete opposite experience.

Havoc
9 replies
20h46m

Notably “The Pile” doesn’t seem to be part of the training data. So this might be more sound legally than many other “open” LLMs

sgu999
8 replies
20h41m

For those also wondering: https://pile.eleuther.ai

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

By what's the legal complication with it?

simonw
5 replies
20h29m

It is absolutely absolutely packed with unlicensed, copyrighted data.

Books3 is the most notable example - nearly 200,000 pirated ebooks - but a lot of the rest of it is (unlicensed) scraped web data.

The legal questions over whether this is a problem are currently still unresolved. Many people are also bothered by the ethical implications, which is a separate issue from the legal questions.

23B1
4 replies
18h8m

Ironic that even our everyday governance has little 'Alignment' between ethics and law.

ben_w
1 replies
6h3m

We wouldn't need lawyers if all the rules could be expressed as "be ethical".

23B1
0 replies
22m

The lawyers certainly agree with you on that!

jacobn
0 replies
15h1m

Ethics are a lot more nuanced and change a lot faster than laws.

Heck, a large fraction of ethics seem to be so fickle that they’re subject to potential revision by every generation.

In fact, I’d argue that those revisions are a significant portion of how one generation distinguishes itself from their parents.

Yet strangely every generation feels like they have arrived at a set of “universal laws” in their ethics.

KarlKemp
0 replies
14h18m

In this case, both ethics and the law are murky.

Pretty excellent alignment, for once?

codazoda
0 replies
18h42m

I took a quick peak at this last time it was mentioned and it had dozens of my own repos of unlicensed source code in it. All of that was published on GitHub and made public, but much of it has no license specified.

blackeyeblitzar
0 replies
20h38m

It received DMCA takedowns: https://en.wikipedia.org/wiki/The_Pile_(dataset)

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. Users responded by creating copies of The Pile with the offending content removed.
vjeux
8 replies
18h43m

If I read the license correctly, it seems that if you want to use the LLM, you need to tell the authors what you are doing with it.

Am I reading this correctly? https://allenai.org/licenses/impact-mr

“Derivative Impact Reports. AI2 seeks to encourage transparency around Derivatives through the use of Derivative Impact Reports, available here. Before releasing a Model Derivative or Data Derivative, You will share with AI2 the intended use(s) of Your Derivative by completing a Derivative Impact Report or otherwise providing AI2 with substantially similar information in writing. You agree that AI2 may publish, post, or make available such information about Your Derivative for review by the general public.

You will use good faith efforts to be transparent about the intended use(s) of Your Derivatives by making the information freely available to others who may access or use Your Derivatives. You acknowledge that Derivative Impact Reports are not intended to penalize any good faith disclosures about Derivatives. Accordingly, if You initiate or participate in any lawsuit or other legal action against a Third Party based on information in such Third Party’s Derivative Impact Report, then this MR Agreement will terminate immediately as of the date such lawsuit or legal action is filed or commenced.”

whimsicalism
3 replies
18h14m

no, this is apache license-d. yes it is confusing that AI2 has custom licenses but they aren't using them here

6gvONxR4sf7o
1 replies
16h19m

Is the license not transitive? Like could your impact report be “i want to remove this part of the license?”

gardnr
0 replies
9h14m

I like the way you think but 2b might prevent that.

mkl
0 replies
18h23m

Does that apply to this model? On huggingface it says "License: The code and model are released under Apache 2.0."

jrm4
0 replies
3h34m

Weird. So even if these things are well intentioned, seems like they don't have any teeth.

Are there any out there that have licenses which are (dare I say) simpler, like the GPL?

blackeyeblitzar
0 replies
18h26m

Interesting. I recall seeing Apache licenses in their official repositories. I wonder how these additional restrictions get pulled in.

Chris2048
0 replies
7h42m

if You initiate or participate in any lawsuit or other legal action ... this MR Agreement will terminate immediately

Is this legal? Restricting legal options by making an agreement dependant on it?

arcza
5 replies
8h11m

sToP bLOgGinG wITh Medium!

egKYzyXeIL
2 replies
7h6m

Why shouldn't people use Medium? I'm probably out of the loop.

gadflyinyoureye
0 replies
7h3m

They often require log in to see the whole article. Later they cap your access to articles to N per some period of time. The only way around that is to purchase a subscription. Given the weak offering of Medium, it’s seldom worth the $/month cost of a subscription for the few jewels that might appear.

arcza
0 replies
7h5m

The nags, the dark patterns, the horrific UI, the soft paywalls, and the tracking, to name a few reasons

barfbagginus
0 replies
7h5m

sToP bLOgGinG wITh Medium!

pksebben
4 replies
14h19m

It's odd. Running inference on this (and other models in its class) and I keep running into a "repeating token" situation with moderate-to-long context windows.

It feels almost as if, during inference, the model hits some format of local minimum that it careens around, and while temperature seems to affect this - it doesn't really fix it.

at temp 0.2:

[{'generated_text': 'What follows is a transcript of a talk between a mysterious man and an agent of a bureau dedicated to investigating things which is typically referred to by some assortment of letters in the alphabet. The identity, origins, and motivations of the man were not known then and remain so. This transcript is not meant to scare, but provided simply to enlighten the concerned citizen of all the various and sundry things that may or may not go bump in the night. AGENT: Please state your name for the record. MYSTERIOUS STRANGER: I am the man. AGENT: Thank you. I am an agent of the Bureau of Investigation. I am here to investigate the following: 1. The following: 2. The following: 3. The following: 4. The following: 5. The following: 6. The following: 7. The following: 8. The following: 9. The following: 10. The following: 11. The following: 12. The following: 13. The following: 14. The following: 15. The following: 16. The following: 17. The following: 18. The following: 19. The following: 20. The following: 21. The following: 22. The following: 23. The following: 24. The following'}]

...and at temp 0.4:

[{'generated_text': 'What follows is a transcript of a talk between a mysterious man and an agent of a bureau dedicated to investigating things which is typically referred to by some assortment of letters in the alphabet. The identity, origins, and motivations of the man were not known then and remain so. This transcript is not meant to scare, but provided simply to enlighten the concerned citizen of all the various and sundry things that may or may not go bump in the night. AGENT: Please state your name for the record. MYSTERIOUS STRANGER: My name is not important. AGENT: My name is Agent Cyanide. MYSTERIOUS STRANGER: Agent Cyanide. AGENT: I am an agent of the Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations'}]
pksebben
2 replies
14h12m

... this can get a little goofy even with do_sample=False and no temp:

| [{'generated_text': "DAUGHTER: tell me a story FATHER: but it's late DAUGHTER: please? FATHER: okay, once upon a time there was a little girl who lived in a little house with her mother and father and her brother and sister and her dog and her cat and her hamster and her fish and her bird and her rabbit and her horse and her cow and her sheep and her goat and her pig and her chicken and her duck and her turkey and her goose and her llama and her alpaca and her camel and her zebra and her giraffe and her elephant and her hippopotamus and her rhinoceros and her kangaroo and her koala and her panda and her bear and her wolf and her fox and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and"}]

gpderetta
1 replies
10h20m

That's seems a perfect story to put a little child to bed :D.

I have used a similar recursive story in the past. My son still jokes about it.

fho
0 replies
9h13m

There actually was a podcast around that concept when (I think) GPT2 was current.

Basically one generated story per day. Absurd in places.

polygamous_bat
0 replies
9h26m

From what I heard through the grapevine, OLMo is not nearly the best model for its size or compute budget. Apparently something didn’t quite go right and AI2 didn’t have the money to train until they got it right.

wg0
2 replies
10h43m

The hype around LLMs won't last past 2030 I suppose. LLMs - we have statistical inference soup that gets outdated like stagnant pond water and by each passing day, becoming less accurate.

I am curious how long the hype wave lasts. Ones I have recently seen was K8S. It settled down and won TBH.

michaelmior
0 replies
9h11m

The transformer architecture probably won't last and we might start calling them something else, but I can't see something that could reasonably be called an LLM going away any time soon.

Grimblewald
0 replies
9h12m

I think the hype dies down and theyll become part of a bigger thing, like dense neural networks.

mysteria
2 replies
20h24m

Is this one of the first LLMs of note that was successfully trained on AMD GPUs? I wonder how seamless the process was and if they faced any issues there.

sanxiyn
0 replies
19h42m

Databricks (who also participated in OLMo, it's probably the same codebase) trained on AMD before, see 2023 post https://www.databricks.com/blog/amd-mi250. It was probably seamless, as any issues were fixed by Databricks in 2023.

lostmsu
2 replies
20h13m

Too bad they did not put any comparison tables into the blog post.

polygamous_bat
0 replies
9h24m

I commented this somewhere else, but word in the ether is that OLMo is not actually that good of a model given its size and compute budget. I am not entirely sure why, and it’s still good to have the full recipe for at least one model out in the open, but the current OLMo definitely is a cautionary tale for people training their own model.

timsuchanek
1 replies
19h35m

Great to see e2e openness. One of the only true OSS models out there, vs most of the models releasing the binaries (weights). Surprised that they didn’t mention Mistral 7b in the comparisons.

sanxiyn
0 replies
19h31m

Falcon also released open dataset.

refulgentis
1 replies
19h44m

This is 2 months old.

btbuildem
0 replies
18h47m

And yet it's topical and relevant.

margorczynski
1 replies
9h46m

1. No biases. Following LLaMA, PaLM, and others, we exclude all bias terms from our architecture in order to improve training stability.

What does this mean? What is a "bias term"?

polygamous_bat
0 replies
9h30m

Think of the term b in y = Wx+b. W is called weight, b is called bias.

timmg
0 replies
20h47m

Has their site been hugged-to-death or is it my hotel wifi?

kikoreis
0 replies
16h39m

What does the risk classification applied to the dataset actually mean? The licensing page [1] AI2 provides for their datasets is really nice but it doesn't really explain [2] what risk means in the context.

Does it mean "risk that the items contained in this set are licensed in a manner incompatible with its use in a training dataset"?

[1] https://allenai.org/impact-license

[2] "the AI2 ImpACT Licenses are artifact-agnostic and are instead structured according to the risk level we’ve assigned a given artifact"

ein0p
0 replies
12h39m

Seems to be surprisingly fast at smaller sizes, too.