HN comments for: Gemma.cpp: lightweight, standalone C++ inference engine for Gemma models

austinvhuang

22 replies

1d4h

2024-02-23 15:48:28 UTC

Hi, one of the authors austin here. Happy to answer any questions the best I can.

To get a few common questions out of the way:

- This is separate / independent of llama.cpp / ggml. I'm a big fan of that project and it was an inspiration (we say as much in the README). I've been a big advocate of gguf + llama.cpp support for gemma and am happy for people to use that.

- how is it different than inference runtime X? gemma.cpp is a direct implementation of gemma, in its current form it's aimed at experimentation + research and portability + easy modifiable rather than a general purpose deployment framework.

- this initial implementation is cpu simd centric. we're exploring options for portable gpu support but the cool thing is it will build and run on a lot of environments you might not expect an llm to run, so long as you have the memory to load the model.

- I'll let other colleagues answer questions about the Gemma model itself, this is a C++ implementation of the model, but relatively independent of the model training process.

- Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.

- I wrote a twitter thread on this project here: https://twitter.com/austinvhuang/status/1760375890448429459

moffkalast

4 replies

1d3h

2024-02-23 16:43:11 UTC

Cool, any plans on adding K quants, an API server and/or a python wrapper? I really doubt most people want to use it as a cpp dependency and run models at FP16.

austinvhuang

3 replies

1d3h

2024-02-23 16:54:18 UTC

There's a custom 8-bit quantization (SFP), it's what we recommend. At 16 bit, we do bfloat16 instead of fp16 thanks to https://github.com/google/highway, even on CPU. Other quants - stay tuned.

python wrapper - if you want to run the model in python I feel like there's already a lot of more mature options available (see the model variations at https://www.kaggle.com/models/google/gemma) , but if people really want this and have something they want to do with a python wrapper that can't be done with existing options let me know. (similar thoughts wrt to API servers).

moffkalast

2 replies

1d3h

2024-02-23 17:19:38 UTC

In my experience there's really no reason to run any model above Q6_K, the performance is identical and you shave off almost 2 GB of VRAM of a 7B model compared to Q8. To those of us with single digit amounts, that's highly significant. But most people seem to go for 4 bits anyway and it's the AWQ standard too. If you think it'll make the model look bad, then don't worry, it's only the relative performance that matters.

I would think that having an OpenAI standard compatible API would be a higher priority over a python wrapper, since then it can act as a drop in replacement for most any backend.

austinvhuang

1 replies

1d3h

2024-02-23 17:27:46 UTC

A nice side effect of implementing cpu simd is you just need enough regular RAM, which tends to be far less scarce than VRAM. Nonetheless, I get your point that more aggressive quantization is valuable + will share with the modeling team.

moffkalast

0 replies

1d3h

2024-02-23 17:32:33 UTC

True, it's the only way I can for example run Mixtral on a 8GB GPU, but main memory will always have more latency so some tradeoff tends to be worth it. And parts like the prompt batch buffer and most of the context generally have to be in VRAM if you want to use cuBLAS, with OpenBLAS it's maybe less of a problem, but it is slower.

verticalscaler

2 replies

1d3h

2024-02-23 17:39:24 UTC

Hi Austin, what say you about how the Gemma rollout was handled, issues raised, and atmosphere around the office? :)

trisfromgoogle

1 replies

1d1h

2024-02-23 19:37:17 UTC

I'm not Austin, but I am Tris, the friendly neighborhood product person on Gemma. Overall, I think that the main feeling is: incredibly relieved to have had the launch go as smoothly as it has! The complexity of the launch is truly astounding:

1) Reference implementations in JAX, PyTorch, TF with Keras 3, MaxText/JAX, more...

2) Full integration at launch with HF including Transformers + optimization therein

3) TensorRT-LLM and full NVIDIA opt across the stack in partnership with that team (mentioned on the NVIDIA earnings call by Jensen, even)

4) More developer surfaces than you can shake a stick at: Kaggle, Colab, Gemma.cpp, GGUF

5) Comms landing with full coordination from Sundar + Demis + Jeff Dean, not to mention positive articles in NYT, Verge, Fortune, etc.

6) Full Google Cloud launches across several major products, including Vertex and GKE

7) Launched globally and with a permissive set of terms that enable developers to do awesome stuff

Pulling that off without any major SNAFUs is a huge relief for the team. We're excited by the potential of using all of those surfaces and the launch momentum to build a lot more great things for you all =)

kergonath

0 replies

2024-02-23 19:49:37 UTC

I am not a fan of a lot of what Google does, but congratulations! That’s a massive undertaking and it is bringing the field forward. I am glad you could do this, and hope you’ll have many other successful releases.

Now, I’m off playing with a new toy :)

rgbrgb

2 replies

1d4h

2024-02-23 16:03:58 UTC

Thanks for releasing this! What is your use case for this rather than llama.cpp? For the on-device AI stuff I mostly do, llama.cpp is better because of GPU/metal offloading.

austinvhuang

1 replies

1d4h

2024-02-23 16:22:04 UTC

llama.cpp is great, if it fit your needs you can use it. I think at this point llama.cpp is effectively a platform that's hardened for production.

In its current form, I think of gemma.cpp is more of a direct model implementation (somewhere between the minimalism of llama2.c and the generality of ggml).

I tend to think of 3 modes of usage:

- hacking on inference internals - there's very little indirection, no IRs, the model is just code, so if you want to add support for your own runtime support for sparsity/quantization/model compression/etc. and demo it working with gemma, there's minimal barriers to do so

- implementing experimental frontends - i'll add some examples of this in the very near future. but you're free to get pretty creative with terminal UIs, code that interact with model internals like the KV cache, accepting/rejecting tokens etc.

- interacting with the model locally with a small program - of course there's other options for this but hopefully this is one way to play with gemma w/ minimal fuss.

castles

0 replies

8h25m

2024-02-24 12:16:26 UTC

accepting/rejecting tokens etc.

That sounds interesting

dartharva

2 replies

1d4h

2024-02-23 16:06:29 UTC

So... llamafile release?

https://github.com/Mozilla-Ocho/llamafile

austinvhuang

1 replies

1d4h

2024-02-23 16:09:27 UTC

gguf files are out there, so anyone should be able to do this! are people looking for an "official" version?

ps i'm a fan of cosmopolitan as well.

jart

0 replies

17h45m

2024-02-24 02:56:48 UTC

Cosmopolitan is a fan of you :-) great work on gemma.cpp. I'm really impressed with it so far.

dankle

2 replies

1d2h

2024-02-23 17:56:09 UTC

What's the reason to not integrate with llama.cpp instead of a separate app? In what ways this better than llama.cpp?

austinvhuang

1 replies

1d2h

2024-02-23 18:16:50 UTC

On uses, see https://news.ycombinator.com/item?id=39481554#39482302 and on llama.cpp support - https://news.ycombinator.com/item?id=39481554

Gemma support has been added to llama.cpp, and we're more than happy to see people use it there.

freedomben

0 replies

1d1h

2024-02-23 18:46:02 UTC

I think on uses you meant to link to https://news.ycombinator.com/item?id=39482581 child of https://news.ycombinator.com/item?id=39481554#39482302 ?

side note: imagine how gnarly those urls would be if HN used UUIDs instead of integers for IDs :-D

beoberha

2 replies

1d4h

2024-02-23 16:09:30 UTC

Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.

This is really cool, Austin. Kudos to your team!

austinvhuang

1 replies

1d4h

2024-02-23 16:30:59 UTC

Thanks so much!

Everyone working on this self-selected into contributing, so I think of it less as my team than ... a team?

Specifically want to call out: Jan Wassenberg (author of https://github.com/google/highway) and I started gemma.cpp as a small project just a few months ago + Phil Culliton, Dan Zheng, and Paul Chang + of course the GDM Gemma team.

trisfromgoogle

0 replies

1d1h

2024-02-23 19:29:42 UTC

Huge +1, this has definitely been a self-forming collective of people who love great AI, great research, and the open community.

Austin and Jan are truly amazing. The optimization work is genuinely outstanding; I get incredible CPU performance on Gemma.cpp for inference. Thanks for all of the awesomeness, Austin =)

leminimal

1 replies

1d3h

2024-02-23 17:11:34 UTC

Kudos on your release! I know this was just made available but

- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.

- The README should explicitly say somehere that there's no GPU support (at the moment)

- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.

- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?

- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?

austinvhuang

0 replies

1d2h

2024-02-23 17:46:33 UTC

Yes - thanks for pointing that out. The README is being updated, you can see an updated WIP in the dev branch: https://github.com/google/gemma.cpp/tree/dev?tab=readme-ov-f... and improving error messages is a high priority.

The weights should be the same across formats, but it's easy for differences to arise due to quantization and/or subtle implementation differences. Minor implementation differences has been a pain point in the ML ecosystem for a while (w/ IRs, onnx, python vs. runtime, etc.), but hopefully the differences aren't too significant (if they are, it's a bug in one of the implementations).

There were quantization fixes like https://twitter.com/ggerganov/status/1760418864418934922 and other patches happening, but it may take a few days for patches to work their way through the ecosystem.

zoogeny

19 replies

1d2h

2024-02-23 18:09:40 UTC

I know a lot of people chide Google for being behind OpenAI in their commercial offerings. We also dunk on them for the over-protective nature of their fine-tuning.

But Google is scarily capable on the LLM front and we shouldn't count them out. OpenAI might have the advantage of being quick to move, but when the juggernaut gets passed its resting inertia and starts to gain momentum it is going to leave an impression.

That became clear to me after watching the recent Jeff Dean video [1] which was posted a few days ago. The depth of institutional knowledge that is going to be unlocked inside Google is actually frightening for me to consider.

I hope the continued competition on the open source front, which we can really thank Facebook and Llama for, keeps these behemoths sharing. As OpenAI moves further from its original mission into capitalizing on its technological lead, we have to remember why the original vision they had is important.

So thank you, Google, for this.

1. https://www.youtube.com/watch?v=oSCRZkSQ1CE&ab_channel=RiceK...

refulgentis

4 replies

23h2m

2024-02-23 21:39:38 UTC

There's nothing provided here other than Jeff Dean gave a stock entry-level presentation to students at Rice, therefore "The depth of institutional knowledge that is going to be unlocked inside Google is actually frightening for me to consider."

You should see Google's turnover numbers from 4 years ago, much less now.

It's been years, it's broken internally, we see the results.

Here, we're in awe of 1KLOC of C++ code that runs inference on the CPU.

Nobody serious is running inference on CPU unless you're on the extreme cutting edge. (ex. I need to on Android and on the Chrome OS Linux VM, but I still use llama.cpp because it does support GPU everywhere else)

I'm not sure what else to say.

(n.b. i am a xoogler)

whitten

1 replies

17h12m

2024-02-24 03:29:47 UTC

Are there any online transcripts or recordings of the Rice presentation from Jeff Dean ?

refulgentis

0 replies

16h53m

2024-02-24 03:48:01 UTC

Yes, footer of parent: https://youtu.be/oSCRZkSQ1CE?si=Na1d1cK3TApDhkSO

janwas

0 replies

16h17m

2024-02-24 04:24:42 UTC

We understand that some teams prefer to use CPU even when mobile GPU would be available.

This code is also intended to facilitate research & experimentation, which may not fall under your definition of 'serious' :)

iaseiadit

0 replies

14h30m

2024-02-24 06:11:57 UTC

You should see Google's turnover numbers from 4 years ago, much less now.

High turnover was industry-wide a few years back because pay went through the roof and job hopping was the best way to capture that.

I suspect it’s lower now, following mass layoffs and substantially fewer openings.

llm_nerd

3 replies

23h21m

2024-02-23 21:20:35 UTC

While I generally agree with you, who has ever counted Google out? We've made fun of Google for lagging while they instead spend their engineering time renaming projects and performing algorithmic white-erasure, but we all knew they're a potent force.

Google has as much or more computing power than anyone. They're massively capitalized and have a market cap of almost $2T and colossal cashflow, and have the ability to throw enormous resources at the problem until they have a competitor. They have an enormous, benchmark-setting amount of data across their various projects to train on. That we're talking like they're some scrappy upstart is super weird.

As OpenAI moves further from its original mission into capitalizing on its technological lead, we have to remember why the original vision they had is important.

I'm way more cynical about the open source models released by the megas, and OpenAI is probably the most honest about their intentions. Meta and Google are releasing these models arguably to kneecap any possible next OpenAI. They want to basically set the market value of anything below state of the art at $0.00, ensuring that there is no breathing room below the $2T cos. These models (Llama, Gemma, etc) are fun toys, but in the end they're completely uncompetitive and will yield zero "wins", so to speak.

loudmax

1 replies

22h22m

2024-02-23 22:19:42 UTC

I certainly would not count out Google's engineering talent. But all the technical expertise in the world won't matter when the leadership is incompetent and dysfunctional. Rolling out a new product takes vision, and it means taking some risks. This is diametrically opposed to how Google operates today. Gemini could be years ahead of ChatGPT (and maybe it is now, if it weren't neutered), but Google's current leadership would have no idea what to do with it.

Google has the technical resources to become a major player here, maybe even the dominant player. But it won't happen under current management. I won't count out Google entirely, and there's still time for the company to be saved. It starts with new leadership.

okdood64

0 replies

15h46m

2024-02-24 04:55:05 UTC

Google has the technical resources to become a major player here

Wait, it's not a major player in ML/AI?

jerpint

0 replies

23h18m

2024-02-23 21:23:48 UTC

Meta and Google are releasing these models arguably to kneecap any possible next OpenAI. They want to basically set the market value of anything below state of the art at $0.00, ensuring that there is no breathing room below the $2T cos

Never thought about it that way, but it makes a lot of sense. It’s also true these models are not up to par with SOTA no matter what the benchmarks say

brigadier132

3 replies

2024-02-23 19:56:37 UTC

There was a podcast yesterday that explained well why Google is in a tough position.

https://youtu.be/-i9AGk3DJ90?t=616

In essence, Google already rules information retrieval. Their margins are insane. Switching to LLM based search cuts into their margins and increases their costs dramatically. Also, the advantage they've built over decades has been cut down.

All of this means there is potential for less profit and a shrinking valuation. A shrinking valuation means issues with employee retention and it could lead to long term stagnation.

brikym

1 replies

22h17m

2024-02-23 22:24:47 UTC

I’m sure Kodak had the same problem with the digital camera.

Eisenstein

0 replies

14h22m

2024-02-24 06:19:36 UTC

They did. They invented and patented the digital camera back in the 70s, refused to improve on it for fear of eating their own market base, and then went bankrupt.

* https://spectrum.ieee.org/first-digital-camera-history

corysama

0 replies

23h14m

2024-02-23 21:27:04 UTC

The Innovator's Dilemma over and over again.

whimsicalism

2 replies

1d2h

2024-02-23 18:24:00 UTC

Realistically, if Google has all this talent, they should have gotten the juggernaut moving in 2020.

Google has had years to get to this stage, and they've lost a lot of the talent that made their initial big splashes to OAI and competitors. Try finding someone on a sparse MoE paper from Google prior to 2022 who is still working there and not at OAI.

With respect, they can hardly even beat Mistral, resorting to rounding down a 7.8b model (w/o embeddings) to 7b.

freedomben

1 replies

1d1h

2024-02-23 18:42:48 UTC

Organizational dysfunction can squash/squander even the most talented engineers. Especially in a big org in big tech. My bet is that their inability to deliver before is probably a result of non-comittal funders/decision makers, product whiplash, corporate politics, and other non-technical challenges.

Google has been the home of the talent for many years. They came on my radar in the late 00s when I used Peter Norvig's textbook in college, and they hired Ray Kurzweil in like 2012 or 2013 IIRC. They were hiring ML PhDs with talent for many years, and they pioneered most of the major innovations. They just got behind on productizing and shipping.

whimsicalism

0 replies

1d1h

2024-02-23 19:20:24 UTC

Right, which was fine for them before there was major competition. But starting in 2020, they have basically attrited most of their talented labor force to OAI and competitors who were not similarly dysfunctional.

dguest

2 replies

1d2h

2024-02-23 18:40:43 UTC

Maybe someone who knows google better can answer my question here: are they behind simply because LLMs are not really their core business? In other words, it wasn't (and still isn't) obvious that LLMs will help them sell add space.

And of course writing that gives me a terrible realization: product placement in LLMs is going to be a very big thing in the near future.

freedomben

0 replies

1d1h

2024-02-23 18:44:15 UTC

I'm an outsider and am speculating based on what I've heard, so maybe I shouldn't even comment, but to me it seems like it's been entirely corporate/organizational reasons. Non-serious funding, shifting priorities, personnel transfers/fluctuations, internal fragmentation, and more. Lack of talent has never been their problem.

elwell

0 replies

1d1h

2024-02-23 19:07:18 UTC

LLM bad because cannibalizes search ads. Wait as long as possible. OpenAI opens pandora's box. Now full speed ahead; catch up and overtake.

a1o

13 replies

1d5h

2024-02-23 15:31:50 UTC

If I want to put a Gemma model in a minimalist command line interface, build it to a standalone exe file that runs offline, what is the size of my final executable? I am interested in how small can the size of something like this be and it still be functional.

brucethemoose2

6 replies

1d5h

2024-02-23 15:40:42 UTC

The code is a basically irrelevant fraction of the model weights. The raw FP16 is like 17GB.

In practice your priority would be fancy quantization, and just any library that compiles down to an executable (like this, MLC-LLM or llama.cpp)

a1o

5 replies

1d4h

2024-02-23 15:50:54 UTC

17GB looks like a lot. Thanks, I will wait until people figure how to make these smaller before trying to use to make something standalone.

wg0

1 replies

1d4h

2024-02-23 16:03:48 UTC

These won't be smaller I guess. Given we keep the number of parameters same.

Pre LLM era (let's say 2020), the hardware used to look decently powerful for most use cases (disks in hundreds of GBs, dozen or two of RAM and quad or hex core processors) but with the advent of LLMs, even disk drives start to look pretty small let alone compute and memory.

brucethemoose2

0 replies

1d4h

2024-02-23 16:15:34 UTC

And cache! The talk of AI hardware is now "how do we fit these darn things inside SRAM?"

swatcoder

0 replies

1d4h

2024-02-23 16:14:54 UTC

It's always going to be a huge quantity of data. Even as efficiency improves, storage and bandwidth are so cheap now that the incentive will be to convert that efficiency towards performance (models with more parameters, ensembles of models, etc) rather than chasing some micro-model that doesn't do as well. It might not always be 17GB, but don't expect some lesser order of magnitude for anything competitive.

As maturity arrives, we'll likely see a handful of competing local models shipped as part of the OS or as redistributable third-party bundles (a la the .NET or Java runtimes) so that individual applications don't all need to be massive.

You'll either need to wait for that or bite the bullet and make something chonky. It's never going to get that small.

sillysaurusx

0 replies

1d4h

2024-02-23 16:12:32 UTC

The average PS5 game seems to be around 45GB. Cyberpunk was 250GB.

Distributing 17GB isn’t a big deal if you shove it into Cloudflare R2.

brucethemoose2

0 replies

1d4h

2024-02-23 16:12:48 UTC

In theory quantized weights of smaller models are under a gigabyte.

If you are looking for megabytes, yeah, those "chat" llms are pretty unusable at that size.

coder543

2 replies

1d5h

2024-02-23 15:39:24 UTC

https://ollama.com/library/gemma/tags

You can see the various quantizations here, both for the 2B model and the 7B model. The smallest you can go is the q2_K quantization of the 2B model, which is 1.3GB, but I wouldn't really call that "functional". The q4_0 quantization is 1.7GB, and that would probably be functional.

The size of anything but the model is going to be rounding error compared to how large the models are, in this context.

sorenjan

1 replies

1d4h

2024-02-23 16:28:59 UTC

What's the use case of models this small? Can you use the "knowledge" encoded in them and ask them questions and get relevant answers, or are they used as text processors to summarize documents etc?

trisfromgoogle

0 replies

2024-02-23 19:51:47 UTC

Gemma 2B generation quality is excellent in my own very-biased opinion. I asked it to write a response to your comment:

Large language models (LLMs) have achieved significant progress in recent years, with models like GPT-3 and LaMDA demonstrating remarkable abilities in various tasks such as language generation, translation, and question answering.

However, 2b parameter models are a much smaller and simpler type of LLM compared to GPT-3. While they are still capable of impressive performance, they have a limited capacity for knowledge representation and reasoning.

Despite their size, 2b parameter models can be useful in certain scenarios where the specific knowledge encoded in the model is relevant to the task at hand. For example:

- Question answering: 2b parameter models can be used to answer questions by leveraging their ability to generate text that is similar to the question.

- Text summarization: 2b parameter models can be used to generate concise summaries of documents by extracting the most important information.

- Code generation: While not as common, 2b parameter models can be used to generate code snippets based on the knowledge they have learned.

Overall, 2b parameter models are a valuable tool for tasks that require specific knowledge or reasoning capabilities. However, for tasks that involve general language understanding and information retrieval, larger LLMs like GPT-3 may be more suitable.

Generated in under 1s from query to full response on together.ai

https://api.together.xyz/playground/chat/google/gemma-2b-it

superkuh

0 replies

1d4h

2024-02-23 16:06:37 UTC

*EDIT*: Nevermind, llamafile hasn't been updated in a full month and gemma support was only added to llama.cpp on the 21st of this month. Disregard this post for now and come back when mozilla updates llamafile.

---

llama.cpp has integrated gemma support. So you can use llamafile for this. It is a standalone executable that is portable across most popular OSes.

https://github.com/Mozilla-Ocho/llamafile/releases

So, download the executable from the releases page under assets. You want either just main and server and llava. Don't get the huge ones with the model inlined in the file. The executable is about 30MB in size,

https://github.com/Mozilla-Ocho/llamafile/releases/download/...

samus

0 replies

1d5h

2024-02-23 15:36:46 UTC

Depends how much you quantize the model. For most general-purpose LLMs, the model completely dwarfs the size of the binary code.

replete

0 replies

1d5h

2024-02-23 15:35:40 UTC

I used gemm:2b with ollama last night and the model was around 1.3gb IIRC

brucethemoose2

6 replies

1d5h

2024-02-23 15:36:16 UTC

...Also, we have eval'd Gemma 7B internally in a deterministic, zero temperature test, and its error rate is like double Mistral Instruct 0.2. Well below most other 7Bs.

Was not very impressed with the chat either.

So maybe this is neat for embedded projects, but if it's Gemma only, that would be quite a sticking point for me.

trisfromgoogle

1 replies

2024-02-23 19:57:26 UTC

Any chance you can share more details on your measurement setup and eval protocols? You're likely seeing some config snafus, which we're trying to track down.

brucethemoose2

0 replies

15h59m

2024-02-24 04:42:59 UTC

I just loaded it in vllm with default settings.

I can't share the eval, but it's pretty simple: it asks a question about some data, and is restricted to only answer yes/no (based on the output logits and suggested in the prompt). It's called with 0 temperature and only 1 output token, so sampling shouldn't be an issue.

Vetch

1 replies

1d4h

2024-02-23 16:24:50 UTC

Was it via gemma.cpp or some other library? I've seen a few people note that gemma performance via gemma.cpp is much better than llama.cpp, possible that the non-google implementations are still not quite right?

brucethemoose2

0 replies

1d4h

2024-02-23 16:34:21 UTC

I eval'd it with vllm.

One thing I do suspect people are running into is sampling issues. Gemma probably doesn't like llama defaults with its 256K vocab.

Many Chinese llms have a similar "default sampling" issue.

But our testing was done with zero temperature and constrained single token responses, so that shouldnt be an issue.

Havoc

1 replies

1d4h

2024-02-23 15:43:48 UTC

That does seem to be the consensus unfortunately. Would have been better for everyone if google’s foray into open model a la FB made a splash

brucethemoose2

0 replies

1d4h

2024-02-23 15:49:05 UTC

Yeah, especially with how much Google is hyping it.

It could have been long context? Or a little bigger, to fill the relative gap in the 13B-30B area? Even if the model itself was mediocre (which you can't know until after training), it would have been more interesting.

throwaway19423

5 replies

1d3h

2024-02-23 16:44:17 UTC

Can any kind soul explain the difference between GGUF, GGML and all the other model packaging I am seeing these days? Was used to pth and the thing tf uses. Is this all to support inference or quantization? Who manages these formats or are they brewing organically?

moffkalast

2 replies

1d2h

2024-02-23 17:44:51 UTC

It's all mostly just inference, though some train LoRAs directly on quantized models too.

GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF.

GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit.

Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning.

Gracana

1 replies

2024-02-23 20:06:37 UTC

and also includes prompt templates

That sounds very convenient. What software makes use of the built-in prompt template?

moffkalast

0 replies

22h55m

2024-02-23 21:46:48 UTC

Of the ones I commonly use, I've only seen it read by text-generation-webui, in the GGML days it had a long hardcoded list of known models and which templates they use so they could be auto-selected (which was often wrong), but now it just grabs it from any model directly and sets it when it's loaded.

liuliu

0 replies

1d2h

2024-02-23 17:58:18 UTC

pth can include Python code (PyTorch code) for inference. TF includes the complete static graph.

GGUF is just weights, safetensors the same thing. GGUF doesn't need a JSON decoder for the format while safetensors needs that.

I personally think having a JSON decoder is not a big deal and make the format more amendable, given GGUF evolves too.

austinvhuang

0 replies

1d3h

2024-02-23 17:02:45 UTC

I think it's mostly an organic process arising from the ecosystem.

My personal way of understanding it is this - the original sin of model weight format complexity is that NNs are both data and computation.

Representing the computation as data is the hard part and that's where the simplicity falls apart. Do you embed the compute graph? If so, what do you do about different frameworks supporting overlapping but distinct operations. Do you need the artifact to make training reproducible? Well that's an even more complex computation that you have to serialize as data. And so on..

brucethemoose2

5 replies

1d5h

2024-02-23 15:32:53 UTC

Not to be confused with llama.cpp and the GGML library, which is a seperate project (and almost immediately worked with Gemma).

throwaway19423

3 replies

1d3h

2024-02-23 16:46:49 UTC

I am confused how all these things are able to interoperate. Are the creators of these models following the same IO for their models? Won't the tokenizer or token embedder be different? I am genuinely confused by how the same code works for so many different models.

brucethemoose2

2 replies

1d3h

2024-02-23 17:19:19 UTC

It's complicated, but basically because most are llama architecture. Meta all but set the standard for open source llms when they released llama1, and anyone trying to deviate from it has run into trouble because the models don't work with the hyper optimized llama runtumes.

Also, there's a lot of magic going on behind the scenes with configs stored in gguf/huggingface format models, and the libraries that use them. There are different tokenizers, but they mostly follow the same standards.

null_point

1 replies

23h58m

2024-02-23 20:43:58 UTC

I found the magic! https://github.com/search?q=repo%3Aggerganov%2Fggml%20magic&...

null_point

0 replies

5h56m

2024-02-24 14:45:53 UTC

Hey, c'mon now. Just being playful about the "magic" string used in GGUF files to detect that it is in-fact a GGUF file.

jebarker

0 replies

1d2h

2024-02-23 17:44:53 UTC

I doubt there'd be confusion as the names are totally different

brokensegue

5 replies

1d4h

2024-02-23 15:46:58 UTC

does anyone have stats on cpu only inference speed with this?

austinvhuang

4 replies

1d3h

2024-02-23 17:09:23 UTC

any particular hardware folks are most interested in?

brokensegue

3 replies

1d3h

2024-02-23 17:13:20 UTC

I'm just looking for ballpark figures. Maybe a common aws instance type

notum

1 replies

22h59m

2024-02-23 21:42:17 UTC

Not sure if this is of any value to you, but Ryzen 7 generates 2 tokens per second for the 7B-Instruct model.

The model itself is very unimpressive and I see no reason to play with it over the worst alternative from Hugging Face. I can only imagine this was released for some bizarre compliance reasons.

brokensegue

0 replies

22h39m

2024-02-23 22:02:30 UTC

the metrics suggest it's much better than that

janwas

0 replies

16h4m

2024-02-24 04:37:29 UTC

For the 7B IT and a short factual query I see 5.3 tps on a 5 year old Skylake Gold 6154 CPU @ 3.00GHz, 16 threads. Expect a slight increase as we improve scalability.

FYI using the NUQ (4.5-bit) quantization improves throughput by about 1.4x.

einpoklum

4 replies

1d4h

2024-02-23 15:57:27 UTC

Come on Dejiko, we don't have time for this gema!

https://www.youtube.com/watch?v=9FSAqDVZHhU

sillysaurusx

1 replies

1d4h

2024-02-23 16:21:29 UTC

Every time I see Gemma all I hear is Jubei screaming Genmaaaa since the n is almost silent. https://youtu.be/TFR9-cZecWo?si=rMED2LEh-fssHeeG

austinvhuang

0 replies

1d3h

2024-02-23 17:06:01 UTC

lol

a-french-anon

1 replies

1d4h

2024-02-23 16:03:34 UTC

Glad I wasn't alone.

einpoklum

0 replies

1d4h

2024-02-23 16:10:52 UTC

Well, it was just so nostalgic for me nyo :-\

colesantiago

4 replies

1d4h

2024-02-23 15:43:24 UTC

Isn't there a huge risk that Google could most likely deprecate Gemini, Gemma and Gemma.cpp? Not really smart to build on anything with Google e.g. Google Cloud for AI.

Has this perception changed or pretty much the same?

ertgbnm

1 replies

1d4h

2024-02-23 15:51:25 UTC

The weights are downloadable so there isn't much of a risk if Google stops hosting Gemma apart from the fact that it won't get new versions that you swap out in the future.

cyanydeez

0 replies

1d2h

2024-02-23 18:02:26 UTC

even if there's a new model, I'm not seeing how these models provide any reliability metric.

if you figure out a money making software/service, you're gonna be tied to that model to some significant degree.

brucethemoose2

0 replies

1d4h

2024-02-23 15:47:06 UTC

This is not necessarily a production backend, as it mentions in the readme.

There are some very interesting efforts in JAX/TPU land like https://github.com/erfanzar/EasyDeL

beoberha

0 replies

1d4h

2024-02-23 15:45:49 UTC

Gemini - maybe, though I find it pretty unlikely it’ll happen anytime soon.

Not sure what you mean about Gemma considering it’s not a service. You can download the model weights and the inference code is on GitHub. Everything is local!

ofermend

2 replies

1d4h

2024-02-23 16:23:48 UTC

Awesome work on getting this done so quickly. We just added Gemma to the HHEM leaderboard - https://huggingface.co/spaces/vectara/leaderboard, and as you can see there its doing pretty good in terms of low hallucination rate, relative to other small models.

swozey

1 replies

1d4h

2024-02-23 16:39:42 UTC

LLM hallucinations

I wasn't familiar with the term, good article - https://masterofcode.com/blog/hallucinations-in-llms-what-yo...

0 replies

1d3h

2024-02-23 17:18:37 UTC

Karpathy offers a more concise (and whimsical) explanation https://x.com/karpathy/status/1733299213503787018

next_xibalba

2 replies

1d3h

2024-02-23 16:42:32 UTC

Is this neutered in the way Gemini is (i.e. is the "censorship" built in) or is that a "feature" of the Gemini application?

jonpo

0 replies

22h29m

2024-02-23 22:13:00 UTC

These models (Gemma) are very difficult to jailbreak.

ComputerGuru

0 replies

1d2h

2024-02-23 18:00:45 UTC

It depends on the model you load/use, the team released both censored and "PT" versions.

dontupvoteme

2 replies

1d1h

2024-02-23 19:16:04 UTC

At the risk of being snarky, it's interesting that Llama.cpp was a 'grassroots' effort originating from a Bulgarian hacker google now launches a corporatized effort inspired by it.

I wonder if there's some analogies to the 80s or 90s in here.

trisfromgoogle

0 replies

2024-02-23 19:45:46 UTC

To be clear, this is not comparable directly to llama.cpp -- Gemma models work on llama.cpp and we encourage people who love llama.cpp to use them there. We're also launched with Ollama.

Gemma.cpp is a highly optimized and lightweight system. The performance is pretty incredible on CPU, give it a try =)

alekandreev

0 replies

11h41m

2024-02-24 09:00:58 UTC

As a fellow Bulgarian from the 80s and 90s myself, and now a part of the Gemma team, I’d say Austin, Jan, and team very much live up to the ethos of hackers I'd meet on BBSes back then. :)

They are driven entirely by their own curiosity and a desire to push computers to the limit. Combined with their admirable low-level programming skills, you get a very solid, fun codebase, that they are sharing with the world.

xrd

1 replies

23h29m

2024-02-23 21:12:14 UTC

I was discussing LLMs with a non technical person on the plane yesterday. I was explaining why LLMs aren't good at math. And, he responded, no, chatgpt is great a multivariate regression, etc.

I'm using LLMs locally almost always and eschewing API backed LLMs like chatgpt. So I'm not very familiar with plugins, and I'm assuming chatgpt plugs into a backend when it detects a math problem. So it isn't the LLM doing the math but to the user it appears to be.

Does anyone here know what LLM projects like llama.cpp or gemma.cpp support a plugin model?

I'm interested in adding to the dungeons and dragons system I built using llama.cpp. Because it doesn't do math well, the combat mode is terrible. But I was writing my own layer to break out when combat mode occurs, and I'm wondering if there is a better way with some kind of plugin approach.

staticman2

0 replies

21h17m

2024-02-23 23:24:42 UTC

Sillytavern is a front end for local and cloud models. They have a simple scripting language and there's been some experiments with adding game functionality with it:

https://chub.ai/characters/creamsan/team-neko-e4f1b2f8

This one says it uses javascript as well:

https://chub.ai/characters/creamsan/tessa-c4b917f9

Thise are the only two listed as SFW. There's some others if you hit the nsfw toggle and search for the scripted tag.I don't know if this is the right approach but you could also write a module for Sillytavern Extras.

tarruda

1 replies

1d3h

2024-02-23 17:10:28 UTC

Is it not possible to add Gemma support on Llama.cpp?

austinvhuang

0 replies

1d3h

2024-02-23 17:17:27 UTC

Gemma support has been added to llama.cpp, in fact it was added almost immediately after the release: https://twitter.com/ggerganov/status/1760293079313973408

However, be aware that there were some quality issues with quantization initially (hopefully they're resolved but i haven't followed too closely): https://twitter.com/ggerganov/status/1760418864418934922

swozey

1 replies

1d4h

2024-02-23 16:33:54 UTC

The velocity of the LLM open source ecosystem is absolutely insane.

I just got into hobby projects with diffusion a week ago and I'm seeing non-stop releases. It's hard to keep up. It's a firehose of information, acronyms, code etc.

It's been a great python refresher.

austinvhuang

0 replies

1d3h

2024-02-23 16:43:06 UTC

Don't be discouraged, you don't have to follow everything.

In fact it's probably better to dive deep into one hobby project like you're doing than constantly context switch with every little news item that comes up.

While working on gemma.cpp there were definitely a lot of "gee i wish i could clone myself and work on that other thing too".

namtranase

1 replies

1d1h

2024-02-23 19:04:11 UTC

Thank the team for the awesome repo. I have navigated gemma.cpp and run it from the first day, it is smooth in my view. So I hope gemma.cpp will continue to add cool features (something like k-quants, server,...) so it can serve more widely. Actually, I have developed a Python wrapper for it: https://github.com/namtranase/gemma-cpp-python The purpose is to use easily and update every new technique from gemma.cpp team.

austinvhuang

0 replies

15h11m

2024-02-24 05:30:52 UTC

Nice, this is really cool to see! There were other threads that expressed interest in something like this.

manlobster

1 replies

16h20m

2024-02-24 04:21:37 UTC

I wonder why they didn't use bazel as their build system.

jvolkman

0 replies

1h24m

2024-02-24 19:17:05 UTC

Looks like they're working on it: https://github.com/google/gemma.cpp/issues/16

Wissan

1 replies

1d3h

2024-02-23 17:00:41 UTC

Hello

sintax

0 replies

1d1h

2024-02-23 18:53:53 UTC

Demo when model quantized to q0_K?

olegbask

0 replies

14h0m

2024-02-24 06:41:08 UTC

It would be amazing to add support for M1 aka Metal: I was able to run Q8 version with llama.cpp and it's blazingly fast. The problem: I don't know how much accuracy it loses and https://huggingface.co/google/gemma-2b-it/tree/main takes too much memory which results in OOMs.

Do you have any estimates on getting Metal support similar to how llama.cpp works?

Why `.gguf` files are so giant compared to `.sbs`? Is it just because they use fp32?

natch

0 replies

1d1h

2024-02-23 19:33:50 UTC

Apart from the fact that they are different things, since they came out of the same organization I think it’s fair to ask:

Do these models have the same kind of odd behavior as Gemini?

kwantaz

0 replies

1d2h

2024-02-23 18:25:56 UTC

nice