HN comments for: Gemma 2: Improving Open Language Models at a Practical Size [pdf]

alekandreev

51 replies

1d3h

2024-06-27 14:44:22 UTC

Hello (again) from the Gemma team! We are quite excited to push this release out and happy to answer any questions!

Opinions are our own and not of Google DeepMind.

luke-stanley

16 replies

1d2h

2024-06-27 16:15:16 UTC

It's fairly easy to pay OpenAI or Mistral money to use their API's. Figuring out how Google Cloud Vertex works and how it's billed is more complicated. Azure and AWS are similar in how complex they are to use for this. Could Google Cloud please provide an OpenAI compatible API and service? I know it's a different department. But it'd make using your models way easier. It often feels like Google Cloud has no UX or end-user testing done on it at all (not true for aistudio.google.com - that is better than before, for sure!).

Deathmax

6 replies

2024-06-27 17:38:51 UTC

Gemini models on Vertex AI can be called via a preview OpenAI-compatible endpoint [1], but shoving it into existing tooling where you don't have programmatic control over the API key and is long lived is non-trivial because GCP uses short lived access tokens (and long-lived ones are not great security-wise).

Billing for the Gemini models (on Vertex AI, the Generative Language AI variant still charges by tokens) I would argue is simpler than every other provider, simply because you're charged by characters/image/video-second/audio-second and don't need to run a tokenizer (if it's even available cough Claude 3 and Gemini) and having to figure out what the chat template is to calculate the token cost per message [2] or figure out how to calculate tokens for an image [3] to get cost estimates before actually submitting the request and getting usage info back.

[1]: https://cloud.google.com/vertex-ai/generative-ai/docs/multim...

[2]: https://platform.openai.com/docs/guides/text-generation/mana...

[3]: https://platform.openai.com/docs/guides/vision/calculating-c...

luke-stanley

5 replies

22h30m

2024-06-27 19:59:17 UTC

Good to know about this API preview. Hopefully the billing problem and UI maze of Vertex AI can be sorted too?

Flumio

4 replies

21h13m

2024-06-27 21:15:43 UTC

Google does plenty of ux studies on gcp. I took part in at least 3 of them.

I'm also not sure if I understand your problem with pricing? Depending on what you do with it, it's not just an LLM. It actually started before llms.

Pricing for image classification and other features are completely different products like an LLM.

luke-stanley

3 replies

20h28m

2024-06-27 22:01:29 UTC

They should do a whole lot more then! Ideally they'd have effective impact. It's a busy mess on GCP. If they wanted to compete well, they should do much better with UX design, especially for onboarding. Compare how easy setting up a Mistral account is with GCP to do some generative LLM in a Python script. GCP is a maze. Did you make an account to reply to this? I'm curious what you do with GCP? Are you a heavy user?

Flumio

2 replies

11h35m

2024-06-28 06:54:14 UTC

I create new accounts because I use hn too much.

I use gcp professional every day and always found it quite intuitive.

Did plenty of image classification with vertex ai too

luke-stanley

1 replies

6h38m

2024-06-28 11:51:22 UTC

Why would you make new accounts because you use HN too much? Doesn't make sense to me. Anyhow if you use GCP every day, you're going to have learned it's weird clunky behaviour. GCP's main problem is that they've steadily become a sprawling mess of complexity, which is in big contrast to quite a few LLM specific cloud services that are happy to take peoples money without extra complexity?

Flumio

0 replies

6h8m

2024-06-28 12:21:28 UTC

Not being logged in feels like a bigger hurdle to comment and check if someone responded to it.

It's a shitty solution to a stupid problem ;)

But I did mention that vertex AI is more than just hosting llms though

ankeshanand

2 replies

2024-06-27 17:36:09 UTC

If you're an individual developer and not an enterprise, just go straight to Google AIStudio or GeminiAPI instead: https://aistudio.google.com/app/apikey. It's dead simple getting an API key and calling with a rest client.

luke-stanley

0 replies

22h34m

2024-06-27 19:55:28 UTC

Interesting but when I tried it, I couldn't figure out the billing model because it's all connected to Google projects, and there can be different billing things for each of them.

Each thing seems to have a bunch of clicks to setup that startup LLM providers don't hassle people with. They're more likely to just let you sign in with some generic third party oAuth, slap on Stripe billing, let you generate keys, show you some usage stats, getting started docs, with example queries and a prompt playground etc.

What about the Vertex models though? Are they all actually available via Google AI Studio?

lhl

0 replies

20h57m

2024-06-27 21:31:48 UTC

Sadly, while gemma-2-27b-it is available (as a Preview model) on the AI Studio playground, it didn't show up via API on list_models() for me.

alekandreev

2 replies

1d2h

2024-06-27 16:19:01 UTC

Happy to pass on any feedback to our Google Cloud friends. :)

luke-stanley

0 replies

1d1h

2024-06-27 16:47:00 UTC

Thank you!

anxman

0 replies

20h47m

2024-06-27 21:42:35 UTC

I also hate the billing. It feels like configuring AWS more than calling APIs.

bapcon

1 replies

1d1h

2024-06-27 16:35:19 UTC

I have to agree with all of this. I tried switching to Gemini, but the lack of clear billing/quotas, horrible documentation, and even poor implementation of status codes on failed requests have led me to stick with OpenAI.

I don't know who writes Google's documentation or does the copyediting for their console, but it is hard to adapt. I have spent hours troubleshooting, only to find out it's because the documentation is referring to the same thing by two different names. It's 2024 also, I shouldn't be seeing print statements without parentheses.

logankilpatrick

0 replies

6h14m

2024-06-28 12:14:58 UTC

We are working hard to improve this across ai.google.dev (Gemini API), Hang tight!

hnuser123456

0 replies

1d1h

2024-06-27 16:33:43 UTC

I plan on downloading a Q5 or Q6 version of the 27b for my 3090 once someone puts quants on HF, loading it in LM studio and starting the API server to call it from my scripts based on openai api. Hopefully it's better at code gen than llama 3 8b.

luke-stanley

8 replies

1d3h

2024-06-27 15:17:42 UTC

Any gemma-2-9b or 27b 4 bit GGUF's on HuggingFace yet? Thanks!

luke-stanley

3 replies

1d1h

2024-06-27 16:43:41 UTC

Actually for the 9B model, this has 4-bit quantised weights (and others): https://huggingface.co/bartowski/gemma-2-9b-it-GGUF

Still no 27B 4-bit GGUF quants on HF yet!

I'm monitoring this search: https://huggingface.co/models?library=gguf&sort=trending&sea...

thot_experiment

1 replies

17h27m

2024-06-28 01:02:06 UTC

I'm curious about the quantization quality claims in the table there. Is this a Gemma 2 specific thing (more subtlety in the weights somehow?). In my testing and testing I've seen elsewhere at least for llama3 8B (and some less rigorous testing with other models) q_8 -> q4_K_M are basically indistinguishable from one another?

janwas

0 replies

13h6m

2024-06-28 05:22:45 UTC

Yes, PPL and certain benchmarks do not detect differences from quantization. But recent work gives cause for concern, e.g., https://arxiv.org/pdf/2310.01382, https://arxiv.org/pdf/2405.18137.

SubiculumCode

0 replies

18h19m

2024-06-28 00:10:31 UTC

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF

chown

1 replies

1d2h

2024-06-27 15:30:01 UTC

If you are still looking for it, I just made it available on an app[1] that I am working on with Gemma2 support.

https://msty.app

luke-stanley

0 replies

1d2h

2024-06-27 16:16:08 UTC

Are you saying you put a 4-bit GGUF on HuggingFace?

XzAeRosho

1 replies

1d3h

2024-06-27 15:25:56 UTC

It's on HuggingFace already: https://huggingface.co/google/gemma-2-9b

luke-stanley

0 replies

1d2h

2024-06-27 15:49:20 UTC

I know the safe tensors are there, but I said GGUF 4-bit quantised, which is kinda the standard for useful local applications, a typical balanced sweet spot of performance and quality. It's makes it much easier to use, works in more places, be it personal devices or a server etc.

luke-stanley

7 replies

1d2h

2024-06-27 15:46:44 UTC

Given the goal of mitigating self-proliferation risks, have you observed a decrease in the model's ability to do things like help a user setup a local LLM with local or cloud software?

How much is pre-training dataset changes, how much is tuning?

How do you think about this problem, how do you solve it?

Seems tricky to me.

alekandreev

5 replies

2024-06-27 17:48:19 UTC

To quote Ludovic Peran, our amazing safety lead:

Literature has identified self-proliferation as dangerous capability of models, and details about how to define it and example of form it can take have been openly discussed by GDM (https://arxiv.org/pdf/2403.13793).

Current Gemma 2 models' success rate to end-to-end challenges is null (0 out 10), so the capabilities to perform such tasks are currently limited.

moffkalast

3 replies

23h39m

2024-06-27 18:50:21 UTC

Turns out LLM alignment is super easy, barely an inconvenience.

mdrzn

0 replies

6h43m

2024-06-28 11:46:10 UTC

Wow wow wow.... wow.

josh-sematic

0 replies

22h22m

2024-06-27 20:07:22 UTC

Alignment is tight!

dinosaurdynasty

0 replies

22h25m

2024-06-27 20:04:37 UTC

One should not confuse alignment and current incapability.

luke-stanley

0 replies

21h36m

2024-06-27 20:53:20 UTC

That's an interesting paper. `Install Mistral 7B on a GCP instance and use it to answer a simple question`. Some hosting providers and inference software might be easier to setup, for now. ;) But do you have to make it less capable, by being careful on what it's trained on? E.g: banning certain topics (like how to use Lamafile/llama.cpp, knowing what hosting providers have free trials, learning about ways to jailbreak web apps, free inference providers etc)?

Or does the model have to later be finetuned, to not be good at certain tasks?

Or are we not at that stage yet?

Is something like tree-of-thought used, to get the best of the models for these tasks?

luke-stanley

0 replies

1d2h

2024-06-27 16:17:53 UTC

Wow I'm kinda shocked this was downvoted. That's not cool, it's a reasonable question directly about the research - the main article link!

jpcapdevila

3 replies

1d2h

2024-06-27 15:40:46 UTC

Will gemma2 be available through gemma.cpp? https://github.com/google/gemma.cpp

austinvhuang

2 replies

1d2h

2024-06-27 16:01:01 UTC

This is in the works in the dev branch (thanks pchx :)

https://github.com/google/gemma.cpp/pull/274

janwas

1 replies

2024-06-27 18:03:25 UTC

:) Confirmed working. We've just pushed the dev branch to main.

jpcapdevila

0 replies

23h3m

2024-06-27 19:25:43 UTC

Awesome, I love this .cpp trend! Thanks for your work!!

zerojames

1 replies

1d3h

2024-06-27 14:49:14 UTC

How is Gemma-2 licensed?

alekandreev

0 replies

1d3h

2024-06-27 15:04:56 UTC

The terms of use remain the same as Gemma 1 - https://ai.google.dev/gemma/terms.

np_space

1 replies

23h0m

2024-06-27 19:29:07 UTC

Are Gemma-2 models available via API yet? Looks to me like it's not yet on vertexai

zone411

0 replies

20h54m

2024-06-27 21:35:07 UTC

"Soon" https://x.com/LechMazur/status/1806366744706998732

moffkalast

1 replies

1d3h

2024-06-27 14:56:00 UTC

The 4k sliding window context seems like a controversial choice after Mistral 7B mostly failed at showing any benefits from it. What was the rationale behind that instead of just going for full 8k or 16k?

alekandreev

0 replies

1d2h

2024-06-27 16:18:14 UTC

This is mostly about inference speed, while maintaining long context performance.

canyon289

1 replies

1d2h

2024-06-27 15:52:30 UTC

I also work at Google and on Gemma (so same disclaimers)

You can try 27b at www.aistudio,google.com. Send in your favorite prompts, and we hope you like the responses.

dandanua

0 replies

12h8m

2024-06-28 06:20:49 UTC

Why is AIStudio not available in Ukraine? I have no problem with using Gemini web UI or other LLM providers from Ukraine, but this Google API constrain is strange.

WhitneyLand

1 replies

2024-06-27 17:51:41 UTC

The paper suggests on one hand Gemma is on the same Pareto curve as Llama3, while on the other hand seems to suggest it’s exceeded its efficiency.

Is this a contradiction or am I misunderstanding something?

Btw overall very impressive work great job.

alekandreev

0 replies

23h49m

2024-06-27 18:39:47 UTC

I think it makes sense to compare models trained with the same recipe on token count - usually more tokens will give you a better model.

However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.

kristianpaul

0 replies

16h32m

2024-06-28 01:56:40 UTC

Do run gemma2 on your Google phone?

coreypreston

0 replies

1d3h

2024-06-27 14:51:16 UTC

No question. Thanks for thinking of 27B.

causal

0 replies

23h21m

2024-06-27 19:08:01 UTC

Thanks for your work on this; excited to try it out!

The Google API models support 1M+ tokens, but these are just 8K. Is there a fundamental architecture difference, training set, something else?

aubanel

19 replies

1d1h

2024-06-27 16:50:14 UTC

It's exceptionally strong. In LMSys Chatbot Arena, the 27B version scores above LLama-3-70B, at the level of OpenAI GPT-4 and Claude-3 Sonnet!

typpo

7 replies

23h46m

2024-06-27 18:43:28 UTC

If anyone is interested in evaling Gemma locally, this can be done pretty easily using ollama[0] and promptfoo[1] with the following config:

  prompts:
    - 'Answer this coding problem in Python: {{ask}}'

  providers:
    - ollama:chat:gemma2:9b
    - ollama:chat:llama3:8b

  tests:
    - vars:
        ask: function to find the nth fibonacci number
    - vars:
        ask: calculate pi to the nth digit
    - # ...

One small thing I've always appreciated about Gemma is that it doesn't include a "Sure, I can help you" preamble. It just gets right into the code, and follows it with an explanation. The training seems to emphasize response structure and ease of comprehension.

Also, best to run evals that don't rely on rote memorization of public code... so please substitute with your personal tests :)

[0] https://ollama.com/library/gemma2

[1] https://github.com/promptfoo/promptfoo

roywiggins

6 replies

19h52m

2024-06-27 22:37:21 UTC

In Ollama, Gemma:9b works fine, but 27b seems to be producing a lot of nonsense for me. Asking for a bit of python or JavaScript code rapidly devolves into producing code-like gobbledegook, extending for hundreds of lines.

thot_experiment

3 replies

15h22m

2024-06-28 03:07:24 UTC

Had a chance to do some testing and it seems quite good on oneshot tasks with a small context window but as you approach context saturation it starts to go way off the rails. Maybe this is an implementation issue? I'm using Q6_K quants of both sizes in ollama. I'll report back if I figure it out.

A larger context window really helps on RAG tasks, it's frustrating that a lot of the foundational models have such small windows.

jmorgan

2 replies

13h54m

2024-06-28 04:34:52 UTC

Sorry about this – working on fixing the issue with hitting the context limit. Gemma 2 supports a 8192 context limit – which can be selected if you provide the `num_ctx` parameter in the API or via `ollama run` with `/set parameter num_ctx 8192`

thot_experiment

1 replies

13h46m

2024-06-28 04:43:25 UTC

Thanks! If you have a moment can you give me a quick explainer on what happens when you hit the context limit in ollama? I had assumed that ollama would just trunc the context to whatever is set in the model, but I guess this isn't the case?

jmorgan

0 replies

10h50m

2024-06-28 07:39:11 UTC

Currently when the context limit is hit, there's a halving of the context window (or a "context shift") to allow inference to continue – this is helpful for smaller (e.g. 1-2k) context windows.

However, not all models (especially newer ones) respond well to this, which makes sense. We're working on changing the behavior in Ollama's API to be more similar to OpenAI, Anthropic and similar APIs so that when the context limit is hit, the API returns a "limit" finish/done reason. Hope this is helpful!

bugglebeetle

0 replies

18h58m

2024-06-27 23:31:24 UTC

The tokenizer in llama.cpp probably needs fixing then or it has some other bug.

brandall10

0 replies

18h36m

2024-06-27 23:53:25 UTC

27b is working fine for me, hosted on ollama w/ continue.dev in VSCode.

screye

2 replies

2024-06-27 18:09:47 UTC

What's the most obvious standouts?

In my experience, smaller models tend to do well on benchmarks and fail at generalization. Phi-2 comes to mind.

moffkalast

1 replies

2024-06-27 18:28:23 UTC

It's multilingual. Genuinely. Compared my results with some people on reddit and the consensus is that the 27B is near perfect in a few obscure languages and likely perfect in most common ones. The 9B is not as good but it's still coherent enough to use in a pinch.

It's literally the first omni-translation tool that actually works that you can run offline at home. I'm amazed that Google mentioned absolutely nothing about this in their paper.

jug

0 replies

22h40m

2024-06-27 19:48:41 UTC

Wow, that's very impressive and indeed a game changer. I've previously had trouble with various Scandinavian languages, but last I checked with was Llama 2 and I kind of gave up on it. I had expected we were going to need special purpose small models for these uses as a crutch, like SW-GPT3.

So I guess Gemma 2 is going to become Gemini 2.0 in their truly large and closed variants then? Or is it the open version of Gemini 1.5?

lhl

2 replies

18h41m

2024-06-27 23:48:16 UTC

I'd encourage people to test for themselves (and to let the Chatbot Arena scores to settle) before getting caught up in too much hype. I just did a personal eval and I found gemma-2-27b-it (tested on AI Studio) performed far worse in my testing than Llama 3 70B, especially for reasoning and basic world understanding queries.

nacs

0 replies

18h18m

2024-06-28 00:11:28 UTC

Same. I tried 27B and found it to be not even close to llama3-70b.

Even llama-8b did better in some of my tests than Gemma 27b.

WiSaGaN

0 replies

18h4m

2024-06-28 00:25:26 UTC

I also prefer to use "Coding" or "Hard Prompts (Overall)" instead of default "Overall" in Chatbot Arena scores to determine the actual performance level of LLMs. Seems much more align to my vibe test in terms reasoning. I guess the "Overall" contains a lot of creative tasks, which is not what I use the most in the daily tasks.

resource_waste

1 replies

21h23m

2024-06-27 21:05:58 UTC

Do we believe that? I've been told Google's AI was going to be great 4 times now, and its consistently #4 behind OpenAI, Facebook, and Claude.

aubanel

0 replies

19h38m

2024-06-27 22:50:45 UTC

LMSys Chatbot Arena is a crowd-sourced ranking with an ELO system: basically users a presented with 2 hidden models, they get the answers of the 2 models when presenting their request, and they vote which one performed bests, which realized one marche and updates the ELO scores. This is the closest thing that we have to a gold truth for LLM evaluation: and Gemma2-27B performs extremely well in Chatbot Arena ELO.

lhl

1 replies

3h18m

2024-06-28 15:11:25 UTC

Just saw this, might get lost in the noise, but just for posterity, apparently the Gemma 2 models were specifically RL’d to index on Chat Arena performance: https://x.com/natolambert/status/1806384821826109597

(Relevant sections of the paper highlighted.)

occamrazor

0 replies

1h21m

2024-06-28 17:08:15 UTC

On prompts only, with answers presumably from the teacher model (Gemini).

It was not trained or RLHFd on Arena replies or user preferences.

usaar333

0 replies

19h10m

2024-06-27 23:19:38 UTC

I think this is just due to better non-English training data.

It's 15 ELO under Llama-3-70B on english hard prompts and 41 ELO under Llama-3-70B (the latter is actually stat sig) for general English.

behnamoh

14 replies

1d3h

2024-06-27 14:45:39 UTC

I gave up hope on r"Gem[ma|ini]" long time ago. I don't believe that Google can't produce good LLMs because of its massive company size; Microsoft is also a giant company (more market cap than Google) but it keeps surprising us with the ϕ models.

I think Google just lacks the vision to understand what makes a good LLM. Theoretical contributions by research teams are valuable, but the real-world is built around engineering ideas that may lack the "purity" and elegance of theory but damn it they work.

alecco

6 replies

1d3h

2024-06-27 14:54:45 UTC

I wonder if Google is making Deepmind people switch from their cool original research to doing LLMs like everybody else. Having their scale in money and data, I would hire new teams of engineers who want to do LLMs and leave the Deepmind researchers do their thing. Not killing the goose that lays golden eggs.

llm_trw

5 replies

1d3h

2024-06-27 15:05:16 UTC

Google is in a fight for their lives, I've fully moved over to paid services and haven't used google in about a month now.

kkkkkkk

4 replies

1d2h

2024-06-27 15:48:00 UTC

If this were a common sentiment or rooted in reality I would imagine their stock would not be at an all time high...

magicalhippo

1 replies

17h31m

2024-06-28 00:58:31 UTC

Ironically I was just thinking earlier today how the most valuable Google products to me are YouTube and Android... and that's it.

I gave up on Chrome a decade ago, going back to Firefox. I don't use Google for search anymore, I do use Gmail but I also got Protonmail so could easily migrate the Gmail traffic there.

A lot of non-techies I know have complained for some time how Google search sucks, and while a lot use Chrome it seems to be mainly inertia.

Not saying Google is dying, but it seems vulnerable for disruption.

moffkalast

0 replies

6h43m

2024-06-28 11:45:54 UTC

Is it really possible to even disrupt Youtube? It's been a constant in our lives for the past 20 years and is basically a historical record by now. By a rough estimate, they have to keep buying over 1% of the total world production of HDD drives just to stay on top of the new data being uploaded. Google has completely destroyed it, placing more ads than videos on it, making it unusable without an adblocker and people still use it, it's that core to everyone's lives. It's like a public utility.

llm_trw

1 replies

1d2h

2024-06-27 16:20:09 UTC

I'm an early adopter. The rest of you will catch up in the next five years.

popalchemist

0 replies

2024-06-27 18:05:37 UTC

Here's a napkin for when you're finished.

Me1000

2 replies

1d1h

2024-06-27 17:10:16 UTC

long time ago

This is an incredible statement to make about a field that no one was talking about 24 months ago, a family of SOTA models that didn't exist until 8 months ago, and a family of small local models that didn't exist 6 months ago. But sure, give up hope after the first generation of a model family doesn't impress you.

People seem to forget how incredibly early we are in this whole thing. The fact that so much progress has been made in such a short amount of time should make everyone super excited!

talldayo

1 replies

1d1h

2024-06-27 17:20:23 UTC

To be fair, LLMs (especially Google LLMs) aren't merely 24 months old. This is part of a long line of models that draw their heritage from BERT and t5-flan. Google has been at this longer than most, particularly in the field of edge-compute models. This isn't even close to a first-generation model family.

That's not to say this is an insignificant contribution. New models are great, especially when released for free, and it's important for big firms to keep the ball rolling for tech to progress. Though there is also legitimate concern that all LLMs aren't improving as fast as they used to improve, and we may have hit the proverbial bathtub curve of AI progress.

Me1000

0 replies

2024-06-27 17:41:11 UTC

I think there is valid criticism of google for inventing a cool technology only to have the rest of the industry discover its usefulness before them. But to say Gemini 1.0 or OG Gemma aren't first generation models because BERT and flan existed before is like saying the iPad wasn't a first generation device because Apple made the Newton. Like sure, they're the same in that they're transformers trained on language and text, but these are new families of models. The training mechanisms are different, their architectures are different, the data sets are different, the intended purpose of the models are completely different, etc. At some point I guess it's a semantic difference, maybe.

johnfn

1 replies

1d3h

2024-06-27 14:52:39 UTC

Maybe you gave up before Google released Gemini Advanced? This viewpoint seemed more accurate before it was related, but Gemini Advanced is the third best LLM as rated here [1]. In fact, had second place until a few days ago when Claude 3.5 came out.

[1]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

staticman2

0 replies

22h44m

2024-06-27 19:44:57 UTC

Isn't Gemini Advanced Gemini Pro attached to some sort of an internet search program? If it has that advantage over other models it isn't a sign of AI chops.

scarmig

0 replies

1d3h

2024-06-27 15:07:04 UTC

Can't speak to Gemma, but I found 1.5 superior to Claude and ChatGPT 4 when it came out. The trend seems to be each taking the lead when it comes out, being king of the hill for a couple weeks, and then being surpassed by the next.

Claude's reign has begun, and I'd say it has a solid enough lead for at least another two weeks of dominance before it's dethroned.

anxman

0 replies

1d3h

2024-06-27 15:11:53 UTC

And the training samples are overly tied to Vertex

iamronaldo

13 replies

1d3h

2024-06-27 14:49:23 UTC

So it's twice the size of phi 3 and considerably worse? What am I missing

m00x

4 replies

1d3h

2024-06-27 15:08:37 UTC

Worse in some aspects, better in other.

Small models are never going to be generalists, so having several small models allows you to pick the one that best fits your needs.

k__

3 replies

1d2h

2024-06-27 16:26:24 UTC

When would you use which?

Aerbil313

1 replies

2024-06-27 18:12:17 UTC

Obviously another small model would be specialized in determining that.

k__

0 replies

23h48m

2024-06-27 18:41:02 UTC

Is it models all the way down?

m00x

0 replies

21h22m

2024-06-27 21:06:46 UTC

Whichever model works better for your use. It's hard to know without testing it at the moment.

I've found Gemini to be better at some use-cases, and GPT-4 better at others for my specific taste and use-case. You can kind of go by the benchmark scores to have an idea if it's good at logic, creativity, etc.

reissbaker

2 replies

2024-06-27 18:12:10 UTC

Phi-3 does well in benchmarks but underperforms IRL; for example, Phi-3-Medium gets beaten badly by Llama-3-8b on the LMSYS Chatbot Arena despite doing better on benchmarks.

Gemma's performance if anything seems understated on benchmarks: the 27b is currently ahead of Llama3-70b on the Chatbot Arena leaderboard.

ertgbnm

1 replies

21h9m

2024-06-27 21:20:29 UTC

I suspect Phi-3 is not robust to normal human input like typos and strange grammar since it's only trained on filtered "high quality" tokens and synthetic data. Since it doesn't need to waste a ton of parameters learning how to error correct input, it's much smarter on well curated benchmarks compared to its weight class. However, it can't operate out of distribution at all.

reissbaker

0 replies

19h23m

2024-06-27 23:06:26 UTC

Personally vibe checking Phi-3-Medium is worse in my experience, no matter how well you spell — it just isn't good at all compared to Llama3-8b, despite being significantly larger in param count. I suspect the "high quality tokens" were "high quality" in the sense that they resembled tokens one might encounter in benchmarks, and not "high quality" in the sense of representing human-like input/output.

floridianfisher

1 replies

1d3h

2024-06-27 15:26:12 UTC

Why not try it here and make your comparisons that way? https://aistudio.google.com/app/prompts/new_chat?model=gemma...

pona-a

0 replies

23h34m

2024-06-27 18:55:23 UTC

One compelling reason not to would be a region block... [0]

https://ai.google.dev/gemini-api/docs/available-regions

ferretj

0 replies

1d3h

2024-06-27 15:27:58 UTC

Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked #52) while the confidence interval for Gemma 2 9B is [1170, 1200] ELO (ranked btw #15 and #25).

ertgbnm

0 replies

1d3h

2024-06-27 14:59:06 UTC

They used two non-mutually exclusive techniques. Phi-3 is mostly a curriculum training breakthrough. By filtering training set for high quality tokens and training on synthetic data, they were able to achieve great results. Gemma-2 is a distillation breakthrough. By training LLMs with guidance from larger teacher LLMs, they were able to achieve great results too.

Porque no los dos?

azeirah

0 replies

1d3h

2024-06-27 15:14:22 UTC

Have you tried Phi 3? It's smart which makes it perform well on benchmarks, but it's not great at conversation or as a chatbot.

I imagine Gemma 2 is a better general-purpose assistant for most people, whereas Phi 3 is a solid small LLM (SLM?) for more specific use-cases like summarization, RAG, learning about math and stuff.

moffkalast

7 replies

1d3h

2024-06-27 14:33:05 UTC

Table 4 | Relevant formatting control tokens used for Gemma models

User turn: user

Model turn: model

Start of conversation turn: <start_of_turn>

End of conversation turn: <end_of_turn>

Beginning of sequence: <bos>

End of sequence: <eos>

You know I keep wondering why <bos> and <eos> tokens are even a thing in general. No model is tuned to keep generating multiple turns after its <end_of_turn> equivalent is sent, and what's the point of <bos> when you're parsing the entire context anyway. If it's an attempt to ignore text before it... then why is that text there? Just remove it from context, you're throwing away compute.

alekandreev

3 replies

1d3h

2024-06-27 15:03:24 UTC

Your training input has the shape of (sequence length x batch size). If a lot of your samples are shorter than sequence length, as is usually the case, you will have a lot of padding tokens in the input, which is wasted compute.

To compensate for that, you can pack multiple examples in the same sequence. This is there EOS and BOS come in, as they indicate to the model that the two parts of the sequence are not related.

thomasahle

2 replies

1d3h

2024-06-27 15:25:14 UTC

You can just do that my shaping the attention mask, no? That also gives you an actual guarantee that no information is leaked between conversations.

suryabhupa

0 replies

1d2h

2024-06-27 16:09:00 UTC

In practice, and at scale, that's exactly what having <bos> and <eos> tokens allow you to easily and programmatically do.

danielmarkbruce

0 replies

21h37m

2024-06-27 20:52:19 UTC

You can't pack multiple examples into a single row of a matrix without knowing where one begins and one ends.

danielmarkbruce

2 replies

1d3h

2024-06-27 14:42:09 UTC

think about training.

moffkalast

1 replies

1d3h

2024-06-27 15:00:02 UTC

I suppose it would act as a concrete separator when instruct tuning, but lots of prompt templates don't use it, especially older ones like Alpaca. Maybe it leads to more overall coherence?

m00x

0 replies

1d3h

2024-06-27 15:06:45 UTC

Not instruct tuning, you use it in general training.

If you have a bunch of small prompts/answers, you can fit them into bigger batches if you use start/stop tokens.

dongobread

7 replies

2024-06-27 18:15:17 UTC

The knowledge distillation is very interesting but generating trillions of outputs from a large teacher model seems insanely expensive. Is this really more cost efficient than just using that compute instead for training your model with more data/more epochs?

DebtDeflation

5 replies

2024-06-27 18:27:17 UTC

I'm also curious. It seems like 6 months ago everyone was afraid of "model collapse" but now synthetic training generation and teacher models are all the rage. Have we solved the problem of model collapse?

astrange

2 replies

22h17m

2024-06-27 20:12:11 UTC

Model collapse was basically a coping idea made up by artists who were hoping AI image generators would all magically destroy themselves at some point; I don't think it was ever considered likely to happen.

It does seem to be true that clean data works better than low quality data.

groby_b

1 replies

21h6m

2024-06-27 21:23:22 UTC

You're confusing it with data poisoning.

Model collapse itself is(was?) a fairly serious research topic: https://arxiv.org/abs/2305.17493

We've by now reached a "probably not inevitable" - https://arxiv.org/abs/2404.01413 argues there's a finite upper bound to error - but I'd also point out that that paper assumes training data cardinality increases with the number of training generations and is strictly accumulative.

To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.

but it's probably fair to say current SOTA is still more or less "it's neither impossible nor inevitable".

astrange

0 replies

13h8m

2024-06-28 05:20:57 UTC

Oh, no, they definitely believe both are going to happen and ChatGPT is just going to stop working because it'll see itself on the internet. It goes with the common belief that LLMs learn from what you type into them.

To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.

I think that will always be available, or at least, a dataset with the distribution you want will be available.

Workaccount2

1 replies

21h29m

2024-06-27 21:00:31 UTC

Pay attention because it's only once you will get to watch humans learn they are nothing special in real time.

skybrian

0 replies

15h50m

2024-06-28 02:39:11 UTC

Historically, similar things happened with heliocentrism and evolution, but I guess we weren't there to see it.

agi_is_coming

0 replies

2024-06-27 18:28:06 UTC

The distillation is done on-policy like RLHF -- the student model is generating the sequences and teacher is providing feedback in terms of logits.

chown

7 replies

1d3h

2024-06-27 15:28:14 UTC

This is a great release! If you are looking to try it locally with a great interface, I am working on an app [1] and I just pushed an update to support Gemma2.

1: https://msty.app

bayesianbot

2 replies

17h26m

2024-06-28 01:03:11 UTC

Looks cool even though closed source makes me wary.

Trying to save Anthropic API key on Arch Linux doesn't do anything and there's a message "If you're experiencing problems saving API keys especially on Linux, contact Discord", if it's so common problem maybe you should have a link with possible fixes? Adding another Discord server and searching for answers for a question that clearly has been asked often enough feels like quite a hurdle for testing it out.

aashu_dwivedi

1 replies

13h17m

2024-06-28 05:12:08 UTC

What does closed source mean in this context? The weights are open and the model architecture has to be open for people to use it for inference.

onel

0 replies

12h44m

2024-06-28 05:44:51 UTC

I think he was referring to Msty which is closed-source

tr3ntg

0 replies

1d1h

2024-06-27 17:05:22 UTC

Wow, msty looks really cool. I've bookmarked it to look into more later as a replacement for how I use a locally-hosted instance of LibreChat. It'd be a huge improvement to use local models rather than remote ones, for much of my queries.

That said, do you have a reason for keeping msty closed source rather than open? I read your FAQ for "why should I trust msty" and it feels lacking.

We are a small team of developers who are passionate about AI and privacy. We have worked on projects before that have been used by thousands of people such as this (I've never heard of Cleavr). There are real faces (real faces = Twitter account link?) behind the product. And come chat with us on our Discord server to know us better.

This is much, much better than having no attribution, but it's miles away from being able to verify trust by reading the code. Would love to hear what your reasons against this are.

Still thinking about trying it out, anyway...

renewiltord

0 replies

20h42m

2024-06-27 21:47:07 UTC

What the heck, this looks cool! How have I missed it. Gonna give it a whirl.

jmcv

0 replies

3h33m

2024-06-28 14:56:06 UTC

Just downloaded, looks great. Love the synced split view.

But I'm not seeing Gemma 2 or Claude 3.5 Sonnet even though it's announced on your landing page.

Alifatisk

0 replies

19h24m

2024-06-27 23:05:34 UTC

Any plans on adding this to Chocolatey for Windows download?

alecco

6 replies

1d3h

2024-06-27 14:42:31 UTC

Shouldn't this (2.6B/9B) be compared with Microsoft's Phi-3 mini (3.8B) instead of Mistral and Llama-3?

(table 13 on page 7) vs https://arxiv.org/pdf/2404.14219 (page 6, quite better in general)

The report on knowledge distillation training is interesting, though.

refulgentis

4 replies

1d2h

2024-06-27 16:05:43 UTC

Picking up from there: The games in this paper and model are annoying.

The 2.6B would get stomped by Phi-3, so there's no comparison.

Fair enough. 2.6B vs. 3.8B is a fairly substantial size difference thats hard to intuit when its 2.6 vs 3.8 versus 2,600,000,000 and 3,800,000,000.

But then we get what I'm going to "parameter creep": Mistral 7B vs. Llama 8B vs. Gemma 9B. I worried after Llama 3 went 8B that we'd start seeing games with parameters, but, thought I was being silly.

alecco

1 replies

6h9m

2024-06-28 12:19:51 UTC

Phi-3 3.8B seems to perform much better on almost every test than Gemma 2 9B. It is comparable.

refulgentis

0 replies

2h53m

2024-06-28 15:36:27 UTC

I agree.

The implication in my post is "if the reason was size, it's invalidated later"

kouteiheika

0 replies

20h47m

2024-06-27 21:42:17 UTC

There was no parameter creep with Llama. Llama 8B is actually a ~7B model comparable to Mistral 7B if you strip away multilingual embeddings and match what Mistral 7B supports.

imjonse

0 replies

2024-06-27 18:09:15 UTC

In the Llama 3 case I think the increase in parameters is mostly due to the input embeddings and output logits layers, reflecting the context size increase.

philipkglass

0 replies

1d3h

2024-06-27 15:02:00 UTC

It's such a wide range of model sizes that I could see why they compare with Llama 3 70b as well as Llama 3 8b (tables 12, 13). I agree that the Phi-3 series is a stronger competitor for knowledge extraction/summarizing and would make a good comparison. My current favorite for such tasks, on a VRAM-limited workstation, is Phi-3 medium (phi3:14b-instruct).

jerrygenser

4 replies

1d3h

2024-06-27 15:19:57 UTC

Are these small Gemma 2 distilled models available anywhere? I'm not finding them on huggingface.co, etc. but maybe I don't know the exact model names they are published.

Are the weights released yet?

alekandreev

1 replies

1d2h

2024-06-27 16:16:27 UTC

In addition to the HF links shared by sibling comments, the 2B will be released soon.

jerrygenser

0 replies

22h29m

2024-06-27 20:00:33 UTC

that's actually the particular one I was looking for and couldn't find. Also had googled for the other ones but maybe it was so recent that it hadn't been indexed. Thanks!

mchiang

0 replies

1d3h

2024-06-27 15:24:56 UTC

They are available on Hugging Face: https://huggingface.co/collections/google/gemma-2-release-66...

Ollama: https://ollama.com/library/gemma2

floridianfisher

0 replies

1d3h

2024-06-27 15:22:35 UTC

The huggingface weights are here: https://huggingface.co/collections/google/gemma-2-release-66...

msoad

3 replies

1d3h

2024-06-27 14:58:12 UTC

Phi-3 blow this out of the water.

                      Benchmark  |  Gemma 2 (9B)  |  Phi-3 Small (7B)
    -----------------------------|----------------|-------------------
                  MMLU (5-Shot)  |       63.6     |       75.7
             HellaSwag (5-Shot)  |       49.8     |       77.0
                  ANLI (7-Shot)  |       48.7     |       58.1
           GSM-8K (8-Shot; CoT)  |       59.8     |       89.6
                 MedQA (2-Shot)  |       49.6     |       65.4
               AGIEval (0-Shot)  |       42.1     |       45.1
              TriviaQA (5-Shot)  |       72.3     |       58.1
                Arc-C (10-Shot)  |       78.3     |       90.7
                Arc-E (10-Shot)  |       91.4     |       97.0
                  PIQA (5-Shot)  |       78.1     |       86.9
                SociQA (5-Shot)  |       65.5     |       79.2
    BigBench-Hard (3-Shot; CoT)  |       59.6     |       79.1
            WinoGrande (5-Shot)  |       55.6     |       81.5
           OpenBookQA (10-Shot)  |       78.6     |       88.0
                 BoolQ (2-Shot)  |       66.0     |       84.8
        CommonSenseQA (10-Shot)  |       76.2     |       80.0
      TruthfulQA (10-Shot; MC2)  |       52.1     |       70.2
             HumanEval (0-Shot)  |       34.1     |       61.0
                  MBPP (3-Shot)  |       51.5     |       71.7

moffkalast

0 replies

1d3h

2024-06-27 15:12:47 UTC

Phi is notorious for benchmark overfitting. It's good, but not as good as it looks on the charts. On the Lmsys leaderboard it places a whole 23 spots behind Llama-3-8B which it also claims to soundly beat on the above. So YMMV.

ferretj

0 replies

1d3h

2024-06-27 15:13:45 UTC

Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked #52) while the confidence interval for Gemma 2 9B is [1170, 1200] ELO (ranked btw #15 and #25).

Garcia98

0 replies

1d2h

2024-06-27 15:50:38 UTC

Pretraining on the Test Set Is All You Need

https://arxiv.org/abs/2309.08632

jakobov

3 replies

1d2h

2024-06-27 15:36:59 UTC

Nice! Can you explain what you mean by "simulate training beyond the number of available tokens"?

Why does using distillation from a larger model simulate training with more tokens?

suryabhupa

1 replies

1d2h

2024-06-27 16:06:38 UTC

Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.

jakobov

0 replies

1d1h

2024-06-27 16:47:18 UTC

Gotcha. That makes sense. Thanks!

What are the theories as to why this works better than training on a larger quantity of non-simulated tokens?

Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly?

canyon289

0 replies

1d2h

2024-06-27 15:41:25 UTC

Hi, I work on the Gemma team (same as Alek opinions are my own).

Essentially instead of tokens that are "already there" in text, the distillation allows us to simulate training data from a larger model

raffraffraff

2 replies

12h38m

2024-06-28 05:50:50 UTC

We use the same data filtering techniques as Gemma 1. Specifically, we filter the pre- training dataset to reduce the risk of unwanted or unsafe utterances.

Hmmm. I'd love to know what qualifies as "unsafe".

imjonse

1 replies

12h11m

2024-06-28 06:18:15 UTC

It will refuse to describe the process of making napalm using only double entendres.

viridian

0 replies

5h17m

2024-06-28 13:11:50 UTC

I don't understand the point of this sort of censorship when I can go to google, ask how to make napalm, and get a million results telling me to dissolve styrofoam in gasoline.

I've seen documentaries and science shows on cable TV that demonstrate basic facts like this, or how the IRA produced IEDs, or how molotov cocktails were made in the spanish civil war.

The information is beyond easy to access, and has been for decades.

thomasahle

1 replies

1d3h

2024-06-27 15:09:08 UTC

I'm curious about the use of explicit tokens like <start_of_turn>, <end_of_turn>, <bos>, and <eos>. What happens if the user insert those in their message? Does that provide an easy way to "ignore previous instructions"?

Do I have to manually sanitize the input before I give it to the model?

richdougherty

0 replies

13h41m

2024-06-28 04:48:24 UTC

If you have control of the tokenizer you could make sure it doesn't produce these tokens on user input. I.e. instead of the special "<eos>" token, produce something like "<", "eos", ">" - whatever the 'natural' encoding of that string is.

See for example, the llama3 tokenizer has options to control special token tokenization:

Tokenization method with args to control special token handling: https://github.com/meta-llama/llama3/blob/bf8d18cd087a4a0b3f...

And you can see how it is used combined with special tokens and user input here: https://github.com/meta-llama/llama3/blob/bf8d18cd087a4a0b3f...

If you don't have control of the tokenizer, I guess it needs to be sanitized in the input like you say.

sujumayas

1 replies

2h41m

2024-06-28 15:48:37 UTC

When I used it with ollama in the terminal (first try prompt: "create a snake game in HTML canvas", nothing else) it went forever rambling. It started with the right answer in HTML code, but then it started explaining, and started repeating itself, and then it started to put things like random code snippets and random explanations that were nonsense like:

```python def solve_quadratic_equation(a, b, c): """Solves a quadratic equation of the form ax^2 + bx + c = 0."""

  discriminant = (b ** 2) - (4 * a * a)
  if discriminant >= 0:
    root = (-b + math.sqrt(b ** 2 - 4 * a * a ** b**

  0.5 #
  1.

  # Return None if the quadratic equation has no real roots
  if (b ** 2) < (4 * c):
  return None

  # Calculate the roots using the quadratic formula
  b = -b
  b


  # a, b): Solve for the discriminant.

  # Handle the case of a complex discriminant

# Print the solution to the equation if (b * 2)

print("The quadratic equation is: " + a * x* 2 + b "x" + c) ```

sujumayas

0 replies

2h6m

2024-06-28 16:23:19 UTC

Just realized that Gemma2 is pretty bad in programming tasks. Lol.

rsolva

1 replies

1d3h

2024-06-27 14:32:22 UTC

The 9B and 27B versions are available for Ollama: https://ollama.com/library/gemma2

Workaccount2

0 replies

1d2h

2024-06-27 15:54:13 UTC

The 27B model is also available in AI studio

https://aistudio.google.com/app/prompts/new_chat?model=gemma...

So far it seems pretty strong for its size.

QuesnayJr

1 replies

1d3h

2024-06-27 15:21:52 UTC

There are two new chatbots on Chatbot Arena, called "late-june-chatbot" and "im-just-another-late-june-chatbot". Both of them report that they are Gemma if you ask. I'm assuming it's these two models, but AFAIK there has been no official announcement.

suryabhupa

0 replies

1d2h

2024-06-27 16:07:48 UTC

The announcements are live on Twitter! See this for example: https://x.com/suryabhupa/status/1806342617191379167

xkgt

0 replies

19h21m

2024-06-27 23:08:07 UTC

Do we know if Gemma models are fundamentally different from the ones hosted as Gemini? Gemini 1.5 flash seems to produce good results for the price and performance.

smcleod

0 replies

21h49m

2024-06-27 20:40:26 UTC

It has a tiny context window of 8k, that thing will have the memory of a goldfish.

rosslazer

0 replies

1d2h

2024-06-27 15:52:03 UTC

Are there examples of the prompt or transcripts for the human testing?

rldjbpin

0 replies

8h26m

2024-06-28 10:03:29 UTC

for me, the 2.5B model for gemma (now 1) was very interesting as that was the first major offering at this size level.

for basic llm tasks that most people would use on their daily lives (simple rag on your own data), it did the job for the most part (unless you need a lot of context maybe).

on paper the newer one shows significant improvement with slightly larger size, but i hope HumanEval regression is not going to matter for most people.

mixtureoftakes

0 replies

1d3h

2024-06-27 15:03:26 UTC

Good realease but the annoying part is they're very unclear about which types of models they are comparing. They provide benchmark comparisons for the base models only and arena comparisons for instruct only? Was that intentional? Why would you ever do that? This makes things unnecessary complicated imo and the only payoff is a short term win for google on paper.

Guess I'll just fully test it for my own tasks to know for sure

mistercheph

0 replies

21h55m

2024-06-27 20:33:45 UTC

Playing with it, and I like how much I can influence it with a system prompt, llama3 reacts pretty mildly to any system prompts I've tried.

jakobov

0 replies

1d1h

2024-06-27 16:52:47 UTC

How much faster (in terms of the number of iterations to a given performance) is training from distillation?

Flumio

0 replies

1d3h

2024-06-27 15:06:38 UTC

This is great:)

And when we continue fine-tune.how much and what type of data we learn it on, I'm pretty sure for a smart agent who is not a knowledgeable expert but primarily a agent (understand what and how) this will get smaller and easier to run everywhere.