HN comments for: Hello OLMo: A truly open LLM

blackeyeblitzar

16 replies

21h1m

2024-04-09 00:04:40 UTC

This is the only LLM that is exciting to me. Clearly, LLMs are powerful tools that may end up replacing search and even go much further than simple searches by performing the research for you and producing final answers. Closed models like those from Open AI (ironically) or Anthropic cannot be audited. When most users will end up blindly hitting Microsoft’s Copilot button, which they are forcing OEMs to adopt, who’s to say how the information a user gets is being curated or manipulated by OpenAI or Microsoft or whoever?

We’ve already seen real world examples of severe bias injected into LLMs. For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...). I don’t think we can just let closed AI systems take over society when they can easily be manipulated by the model owners without transparency.

What I like about AI2’s approach with OLMo is that they are actually open, not just trading on the marketing benefits of the word “open”. Most “open” models are just open weights not open source. That’s like sharing an executable and not the source code. In my view, being open means that others have to be able to reproduce the final product (the model) if they wanted to and had the means (in terms of training hardware). It also means that they should be able to use whatever is provided freely for any purpose, rather than being subject to proprietary licensing. AI2 shares the training source code, training data, evaluation suite, and the model weights that they’ve produced by running the training process. It all uses the Apache license. And it’s also interesting that they used AMD hardware to train this LLM rather than Nvidia/CUDA.

Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others. My hope is that truly open models like OLMa keep developing quickly enough to also keep up. Lastly, I hope that regulation does not block open source private development of AI systems. These systems will be the vehicle for speech for much of society in the future, so blocking private AI systems is a lot like restricting speech. But leaving that aside, open development will also drive innovation and reducing competitive pressure will hurt innovation.

simonw

8 replies

20h14m

2024-04-09 00:51:28 UTC

Pet peeve: Google's Gemini LLM model was not to blame for the image generation weirdness.

That would be like blaming DALL-E weirdness on GPT-4.

Unfortunately, Google marketing decided to slap the "Gemini" brand on both the end-user interface used to interact with the model AND the actual model itself, hence people constantly calling out Gemini-the-model for weird decisions made as part of Gemini-the-user-interface.

espadrine

3 replies

8h52m

2024-04-09 12:12:54 UTC

Google's Gemini LLM model was not to blame for the image generation weirdness. That would be like blaming DALL-E weirdness on GPT-4.

The way I read the Gemini technical report, it seemed like, unlike GPT-4 vs DALL-E, Gemini was pretrained with multimodal outputs. Is that not the case?

simonw

2 replies

4h36m

2024-04-09 16:28:53 UTC

Is that right? I didn't think Gemini was generating images directly, I assumed it was using a separate image generation tool.

The paper here https://arxiv.org/pdf/2403.05530.pdf has a model card for Gemini 1.5 Pro that says:

    Output(s): Generated text in response to the input
    (e.g., an answer to the question, a summary of
    multiple documents, comparing documents/videos).

espadrine

1 replies

4h18m

2024-04-09 16:47:22 UTC

Huh, that is true in both the model cards of Gemini 1.5 Pro and Gemini 1.0.

That feels like it runs counter to this statement from the Gemini 1.0 technical report[0]:

Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs

[0]: https://arxiv.org/pdf/2312.11805.pdf

simonw

0 replies

3h33m

2024-04-09 17:32:42 UTC

Yeah what does that bit about "image outputs" mean I wonder?

1 replies

17h35m

2024-04-09 03:30:10 UTC

Did anybody manage to get the entire prompt out of gemini, or what are you basing your claim on?

simonw

0 replies

4h36m

2024-04-09 16:29:30 UTC

That's my point. The system prompt isn't part of the model - it's part of the UI system that wraps the model.

michaelt

1 replies

10h48m

2024-04-09 10:17:18 UTC

> That would be like blaming DALL-E weirdness on GPT-4.

Actually when you trigger DALL-E through GPT-4 (i.e. with the LLM generating the prompt to give the diffusion model then returning the resulting image to the user) the LLM's system instructions [1] say "7. Diversify depictions of ALL images with people to always include always DESCENT and GENDER for EACH person using direct terms." and a bunch of stuff along those lines.

In OpenAI's system this doesn't always trigger; if the user asks for an image of trash being collected, the user hasn't explicitly asked for any people to be depicted, so the LLM doesn't find anything in the prompt that needs diversity added. The trash-being-collected prompt gets passed to DALL-E unmodified, and the resulting image has all male workers.

[1] https://raw.githubusercontent.com/spdustin/ChatGPT-AutoExper...

simonw

0 replies

4h40m

2024-04-09 16:25:14 UTC

Yeah, I wrote about that last year: https://simonwillison.net/2023/Oct/26/add-a-walrus/#diversif...

Again, that's not a GPT-4 thing: that's a ChatGPT interface running GPT-4 with DALL-E as a tool thing.

gremlinunderway

2 replies

20h22m

2024-04-09 00:43:07 UTC

For example, Google’s Gemini had secret meta prompts that biased it towards certain types of answers and also caused it to produce hallucinated images that were funny but also dystopian (https://arstechnica.com/information-technology/2024/02/googl...).

Such a bizarre take to call this "dystopian".

The model happened to create some out-there pictures. I mean, it's no more outlandish then giant dragons and snakes and such being created yet the thought of a person of color being something historically inaccurate is this massive outcry against revisionism? Who cares?

Besides, the article identifies the probable goal which was to eliminate very known biases in existing models (i.e. when generating "angry person" you mainly got black people). Clearly this one wasnt tuned well for that goal, but the objective is not only noble but absolutely should be required for anyone producing LLM models.

lynx23

0 replies

12h27m

2024-04-09 08:38:07 UTC

Right, "who cares" about the truth in our dystopian world? 1984 is apparently too long ago for people to remember the ministry of truth...

blackeyeblitzar

0 replies

20h7m

2024-04-09 00:58:03 UTC

If I may explain: the dystopian part to me is the lack of transparency around training code, training data sources, tuning, meta prompting, and so forth. In Google’s case, they’re a large corporation that controls how much of society accesses information. If they’re secretly curating what that information is, rather than presenting it as neutrally as they can, it does feel dystopian to me. I’d like transparency as a consumer of information, so I know to the extent possible, what the sources of information were or how I am being manipulated by choices the humans building these systems made.

I appreciate the issue you’re drawing attention to in the example you shared about images of an angry person. I think I agree that focused tuning for situations like that might be noble and I would be okay with a model correcting for that specific example you shared. But I also struggle with how to clearly draw that line where such tuning may go too far, which is why I favor less manual biasing. But I disagree that such tuning should be required, if you meant required by the law. Like with speech or art in general, I think anyone should be able to produce software systems that generate controversial or offensive speech or art. Individual consumers can choose what they want to interact with, and reject LLMs that don’t meet their personal standards.

blackeyeblitzar

2 replies

20h48m

2024-04-09 00:17:44 UTC

One thing I wanted to add and call attention to is the importance of licensing in open models. This is often overlooked when we blindly accept the vague branding of models as “open”, but I am noticing that many open weight models are actually using encumbered proprietary licenses rather than standard open source licenses that are OSI approved (https://opensource.org/licenses). As an example, Databricks’s DBRX model has a proprietary license that forces adherence to their highly restrictive Acceptable Use Policy by referencing a live website hosting their AUP (https://github.com/databricks/dbrx/blob/main/LICENSE), which means as they change their AUP, you may be further restricted in the future. Meta’s Llama is similar (https://github.com/meta-llama/llama/blob/main/LICENSE ). I’m not sure who can depend on these models given this flaw.

idle_zealot

1 replies

19h13m

2024-04-09 01:51:55 UTC

Do we even know if these licenses are binding? AFAIK we have no ruling on whether model weights are even eligible for copyright. They're machine-produced derivatives of other work, so it's not a guarantee that copyright protects them.

blackeyeblitzar

0 replies

18h11m

2024-04-09 02:53:53 UTC

That’s a great point and I hope more people speak up to treat models as just numerical derivative works so they aren’t automatically granted these protections. It’s better if society meaningfully debates this and chooses the right approach.

theshackleford

0 replies

13h48m

2024-04-09 07:16:57 UTC

Open weight models like Llama keep repeatedly catching up to the best closed models from OpenAI or Anthropic or others.

Since when? I’ve had the complete opposite experience.

Havoc

9 replies

20h46m

2024-04-09 00:18:59 UTC

Notably “The Pile” doesn’t seem to be part of the training data. So this might be more sound legally than many other “open” LLMs

sgu999

8 replies

20h41m

2024-04-09 00:23:53 UTC

For those also wondering: https://pile.eleuther.ai

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

By what's the legal complication with it?

simonw

5 replies

20h29m

2024-04-09 00:36:48 UTC

It is absolutely absolutely packed with unlicensed, copyrighted data.

Books3 is the most notable example - nearly 200,000 pirated ebooks - but a lot of the rest of it is (unlicensed) scraped web data.

The legal questions over whether this is a problem are currently still unresolved. Many people are also bothered by the ethical implications, which is a separate issue from the legal questions.

23B1

4 replies

18h8m

2024-04-09 02:57:24 UTC

Ironic that even our everyday governance has little 'Alignment' between ethics and law.

ben_w

1 replies

6h3m

2024-04-09 15:02:23 UTC

We wouldn't need lawyers if all the rules could be expressed as "be ethical".

23B1

0 replies

22m

2024-04-09 20:43:44 UTC

The lawyers certainly agree with you on that!

jacobn

0 replies

15h1m

2024-04-09 06:04:25 UTC

Ethics are a lot more nuanced and change a lot faster than laws.

Heck, a large fraction of ethics seem to be so fickle that they’re subject to potential revision by every generation.

In fact, I’d argue that those revisions are a significant portion of how one generation distinguishes itself from their parents.

Yet strangely every generation feels like they have arrived at a set of “universal laws” in their ethics.

KarlKemp

0 replies

14h18m

2024-04-09 06:47:43 UTC

In this case, both ethics and the law are murky.

Pretty excellent alignment, for once?

codazoda

0 replies

18h42m

2024-04-09 02:23:07 UTC

I took a quick peak at this last time it was mentioned and it had dozens of my own repos of unlicensed source code in it. All of that was published on GitHub and made public, but much of it has no license specified.

blackeyeblitzar

0 replies

20h38m

2024-04-09 00:27:13 UTC

It received DMCA takedowns: https://en.wikipedia.org/wiki/The_Pile_(dataset)

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. Users responded by creating copies of The Pile with the offending content removed.

vjeux

8 replies

18h43m

2024-04-09 02:22:48 UTC

If I read the license correctly, it seems that if you want to use the LLM, you need to tell the authors what you are doing with it.

Am I reading this correctly? https://allenai.org/licenses/impact-mr

“Derivative Impact Reports. AI2 seeks to encourage transparency around Derivatives through the use of Derivative Impact Reports, available here. Before releasing a Model Derivative or Data Derivative, You will share with AI2 the intended use(s) of Your Derivative by completing a Derivative Impact Report or otherwise providing AI2 with substantially similar information in writing. You agree that AI2 may publish, post, or make available such information about Your Derivative for review by the general public.

You will use good faith efforts to be transparent about the intended use(s) of Your Derivatives by making the information freely available to others who may access or use Your Derivatives. You acknowledge that Derivative Impact Reports are not intended to penalize any good faith disclosures about Derivatives. Accordingly, if You initiate or participate in any lawsuit or other legal action against a Third Party based on information in such Third Party’s Derivative Impact Report, then this MR Agreement will terminate immediately as of the date such lawsuit or legal action is filed or commenced.”

whimsicalism

3 replies

18h14m

2024-04-09 02:50:55 UTC

no, this is apache license-d. yes it is confusing that AI2 has custom licenses but they aren't using them here

lolinder

2 replies

16h56m

2024-04-09 04:09:08 UTC

It looks like the weights [0] and code [1] are Apache licensed, but the training data [2] is using the license that OP is quoting from.

[0] https://huggingface.co/allenai/OLMo-7B

[1] https://github.com/allenai/OLMo

[2] https://huggingface.co/datasets/allenai/dolma

6gvONxR4sf7o

1 replies

16h19m

2024-04-09 04:45:57 UTC

Is the license not transitive? Like could your impact report be “i want to remove this part of the license?”

gardnr

0 replies

9h14m

2024-04-09 11:51:47 UTC

I like the way you think but 2b might prevent that.

mkl

0 replies

18h23m

2024-04-09 02:42:20 UTC

Does that apply to this model? On huggingface it says "License: The code and model are released under Apache 2.0."

jrm4

0 replies

3h34m

2024-04-09 17:31:11 UTC

Weird. So even if these things are well intentioned, seems like they don't have any teeth.

Are there any out there that have licenses which are (dare I say) simpler, like the GPL?

blackeyeblitzar

0 replies

18h26m

2024-04-09 02:38:56 UTC

Interesting. I recall seeing Apache licenses in their official repositories. I wonder how these additional restrictions get pulled in.

Chris2048

0 replies

7h42m

2024-04-09 13:23:23 UTC

if You initiate or participate in any lawsuit or other legal action ... this MR Agreement will terminate immediately

Is this legal? Restricting legal options by making an agreement dependant on it?

arcza

5 replies

8h11m

2024-04-09 12:53:58 UTC

sToP bLOgGinG wITh Medium!

egKYzyXeIL

2 replies

7h6m

2024-04-09 13:59:10 UTC

Why shouldn't people use Medium? I'm probably out of the loop.

gadflyinyoureye

0 replies

7h3m

2024-04-09 14:01:54 UTC

They often require log in to see the whole article. Later they cap your access to articles to N per some period of time. The only way around that is to purchase a subscription. Given the weak offering of Medium, it’s seldom worth the $/month cost of a subscription for the few jewels that might appear.

arcza

0 replies

7h5m

2024-04-09 13:59:55 UTC

The nags, the dark patterns, the horrific UI, the soft paywalls, and the tracking, to name a few reasons

flotzam

0 replies

6h9m

2024-04-09 14:56:11 UTC

https://scribe.rip/hello-olmo-a-truly-open-llm-43f7e7359222

barfbagginus

0 replies

7h5m

2024-04-09 14:00:42 UTC

sToP bLOgGinG wITh Medium!

pksebben

4 replies

14h19m

2024-04-09 06:46:37 UTC

It's odd. Running inference on this (and other models in its class) and I keep running into a "repeating token" situation with moderate-to-long context windows.

It feels almost as if, during inference, the model hits some format of local minimum that it careens around, and while temperature seems to affect this - it doesn't really fix it.

at temp 0.2:

[{'generated_text': 'What follows is a transcript of a talk between a mysterious man and an agent of a bureau dedicated to investigating things which is typically referred to by some assortment of letters in the alphabet. The identity, origins, and motivations of the man were not known then and remain so. This transcript is not meant to scare, but provided simply to enlighten the concerned citizen of all the various and sundry things that may or may not go bump in the night. AGENT: Please state your name for the record. MYSTERIOUS STRANGER: I am the man. AGENT: Thank you. I am an agent of the Bureau of Investigation. I am here to investigate the following: 1. The following: 2. The following: 3. The following: 4. The following: 5. The following: 6. The following: 7. The following: 8. The following: 9. The following: 10. The following: 11. The following: 12. The following: 13. The following: 14. The following: 15. The following: 16. The following: 17. The following: 18. The following: 19. The following: 20. The following: 21. The following: 22. The following: 23. The following: 24. The following'}]

...and at temp 0.4:

[{'generated_text': 'What follows is a transcript of a talk between a mysterious man and an agent of a bureau dedicated to investigating things which is typically referred to by some assortment of letters in the alphabet. The identity, origins, and motivations of the man were not known then and remain so. This transcript is not meant to scare, but provided simply to enlighten the concerned citizen of all the various and sundry things that may or may not go bump in the night. AGENT: Please state your name for the record. MYSTERIOUS STRANGER: My name is not important. AGENT: My name is Agent Cyanide. MYSTERIOUS STRANGER: Agent Cyanide. AGENT: I am an agent of the Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations. AGENT: The Bureau of Investigations. MYSTERIOUS STRANGER: The Bureau of Investigations'}]

pksebben

2 replies

14h12m

2024-04-09 06:53:38 UTC

... this can get a little goofy even with do_sample=False and no temp:

| [{'generated_text': "DAUGHTER: tell me a story FATHER: but it's late DAUGHTER: please? FATHER: okay, once upon a time there was a little girl who lived in a little house with her mother and father and her brother and sister and her dog and her cat and her hamster and her fish and her bird and her rabbit and her horse and her cow and her sheep and her goat and her pig and her chicken and her duck and her turkey and her goose and her llama and her alpaca and her camel and her zebra and her giraffe and her elephant and her hippopotamus and her rhinoceros and her kangaroo and her koala and her panda and her bear and her wolf and her fox and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and her cat and her dog and her bird and her fish and her hamster and"}]

gpderetta

1 replies

10h20m

2024-04-09 10:45:32 UTC

That's seems a perfect story to put a little child to bed :D.

I have used a similar recursive story in the past. My son still jokes about it.

fho

0 replies

9h13m

2024-04-09 11:52:00 UTC

There actually was a podcast around that concept when (I think) GPT2 was current.

Basically one generated story per day. Absurd in places.

polygamous_bat

0 replies

9h26m

2024-04-09 11:39:19 UTC

From what I heard through the grapevine, OLMo is not nearly the best model for its size or compute budget. Apparently something didn’t quite go right and AI2 didn’t have the money to train until they got it right.

wg0

2 replies

10h43m

2024-04-09 10:22:41 UTC

The hype around LLMs won't last past 2030 I suppose. LLMs - we have statistical inference soup that gets outdated like stagnant pond water and by each passing day, becoming less accurate.

I am curious how long the hype wave lasts. Ones I have recently seen was K8S. It settled down and won TBH.

michaelmior

0 replies

9h11m

2024-04-09 11:54:07 UTC

The transformer architecture probably won't last and we might start calling them something else, but I can't see something that could reasonably be called an LLM going away any time soon.

Grimblewald

0 replies

9h12m

2024-04-09 11:53:22 UTC

I think the hype dies down and theyll become part of a bigger thing, like dense neural networks.

mysteria

2 replies

20h24m

2024-04-09 00:41:32 UTC

Is this one of the first LLMs of note that was successfully trained on AMD GPUs? I wonder how seamless the process was and if they faced any issues there.

sanxiyn

0 replies

19h42m

2024-04-09 01:23:06 UTC

Databricks (who also participated in OLMo, it's probably the same codebase) trained on AMD before, see 2023 post https://www.databricks.com/blog/amd-mi250. It was probably seamless, as any issues were fixed by Databricks in 2023.

otuutti

0 replies

10h8m

2024-04-09 10:57:47 UTC

https://huggingface.co/LumiOpen/Poro-34B Also fully trained on LUMI.

(more models here: https://huggingface.co/LumiOpen)

lostmsu

2 replies

20h13m

2024-04-09 00:52:39 UTC

Too bad they did not put any comparison tables into the blog post.

mysteria

1 replies

20h5m

2024-04-09 00:59:54 UTC

They're on Hugging Face. Interestingly enough they don't compare it against Mistral 7B.

https://huggingface.co/allenai/OLMo-7B

polygamous_bat

0 replies

9h24m

2024-04-09 11:40:56 UTC

I commented this somewhere else, but word in the ether is that OLMo is not actually that good of a model given its size and compute budget. I am not entirely sure why, and it’s still good to have the full recipe for at least one model out in the open, but the current OLMo definitely is a cautionary tale for people training their own model.

timsuchanek

1 replies

19h35m

2024-04-09 01:30:34 UTC

Great to see e2e openness. One of the only true OSS models out there, vs most of the models releasing the binaries (weights). Surprised that they didn’t mention Mistral 7b in the comparisons.

sanxiyn

0 replies

19h31m

2024-04-09 01:34:42 UTC

Falcon also released open dataset.

refulgentis

1 replies

19h44m

2024-04-09 01:21:16 UTC

This is 2 months old.

btbuildem

0 replies

18h47m

2024-04-09 02:18:30 UTC

And yet it's topical and relevant.

margorczynski

1 replies

9h46m

2024-04-09 11:18:53 UTC

1. No biases. Following LLaMA, PaLM, and others, we exclude all bias terms from our architecture in order to improve training stability.

What does this mean? What is a "bias term"?

polygamous_bat

0 replies

9h30m

2024-04-09 11:35:43 UTC

Think of the term b in y = Wx+b. W is called weight, b is called bias.

timmg

0 replies

20h47m

2024-04-09 00:17:58 UTC

Has their site been hugged-to-death or is it my hotel wifi?

kikoreis

0 replies

16h39m

2024-04-09 04:25:59 UTC

What does the risk classification applied to the dataset actually mean? The licensing page [1] AI2 provides for their datasets is really nice but it doesn't really explain [2] what risk means in the context.

Does it mean "risk that the items contained in this set are licensed in a manner incompatible with its use in a training dataset"?

[1] https://allenai.org/impact-license

[2] "the AI2 ImpACT Licenses are artifact-agnostic and are instead structured according to the risk level we’ve assigned a given artifact"

ein0p

0 replies

12h39m

2024-04-09 08:26:38 UTC

Seems to be surprisingly fast at smaller sizes, too.