Hi, one of the authors austin here. Happy to answer any questions the best I can.
To get a few common questions out of the way:
- This is separate / independent of llama.cpp / ggml. I'm a big fan of that project and it was an inspiration (we say as much in the README). I've been a big advocate of gguf + llama.cpp support for gemma and am happy for people to use that.
- how is it different than inference runtime X? gemma.cpp is a direct implementation of gemma, in its current form it's aimed at experimentation + research and portability + easy modifiable rather than a general purpose deployment framework.
- this initial implementation is cpu simd centric. we're exploring options for portable gpu support but the cool thing is it will build and run on a lot of environments you might not expect an llm to run, so long as you have the memory to load the model.
- I'll let other colleagues answer questions about the Gemma model itself, this is a C++ implementation of the model, but relatively independent of the model training process.
- Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.
- I wrote a twitter thread on this project here: https://twitter.com/austinvhuang/status/1760375890448429459
Cool, any plans on adding K quants, an API server and/or a python wrapper? I really doubt most people want to use it as a cpp dependency and run models at FP16.
There's a custom 8-bit quantization (SFP), it's what we recommend. At 16 bit, we do bfloat16 instead of fp16 thanks to https://github.com/google/highway, even on CPU. Other quants - stay tuned.
python wrapper - if you want to run the model in python I feel like there's already a lot of more mature options available (see the model variations at https://www.kaggle.com/models/google/gemma) , but if people really want this and have something they want to do with a python wrapper that can't be done with existing options let me know. (similar thoughts wrt to API servers).
In my experience there's really no reason to run any model above Q6_K, the performance is identical and you shave off almost 2 GB of VRAM of a 7B model compared to Q8. To those of us with single digit amounts, that's highly significant. But most people seem to go for 4 bits anyway and it's the AWQ standard too. If you think it'll make the model look bad, then don't worry, it's only the relative performance that matters.
I would think that having an OpenAI standard compatible API would be a higher priority over a python wrapper, since then it can act as a drop in replacement for most any backend.
A nice side effect of implementing cpu simd is you just need enough regular RAM, which tends to be far less scarce than VRAM. Nonetheless, I get your point that more aggressive quantization is valuable + will share with the modeling team.
True, it's the only way I can for example run Mixtral on a 8GB GPU, but main memory will always have more latency so some tradeoff tends to be worth it. And parts like the prompt batch buffer and most of the context generally have to be in VRAM if you want to use cuBLAS, with OpenBLAS it's maybe less of a problem, but it is slower.
Hi Austin, what say you about how the Gemma rollout was handled, issues raised, and atmosphere around the office? :)
I'm not Austin, but I am Tris, the friendly neighborhood product person on Gemma. Overall, I think that the main feeling is: incredibly relieved to have had the launch go as smoothly as it has! The complexity of the launch is truly astounding:
1) Reference implementations in JAX, PyTorch, TF with Keras 3, MaxText/JAX, more...
2) Full integration at launch with HF including Transformers + optimization therein
3) TensorRT-LLM and full NVIDIA opt across the stack in partnership with that team (mentioned on the NVIDIA earnings call by Jensen, even)
4) More developer surfaces than you can shake a stick at: Kaggle, Colab, Gemma.cpp, GGUF
5) Comms landing with full coordination from Sundar + Demis + Jeff Dean, not to mention positive articles in NYT, Verge, Fortune, etc.
6) Full Google Cloud launches across several major products, including Vertex and GKE
7) Launched globally and with a permissive set of terms that enable developers to do awesome stuff
Pulling that off without any major SNAFUs is a huge relief for the team. We're excited by the potential of using all of those surfaces and the launch momentum to build a lot more great things for you all =)
I am not a fan of a lot of what Google does, but congratulations! That’s a massive undertaking and it is bringing the field forward. I am glad you could do this, and hope you’ll have many other successful releases.
Now, I’m off playing with a new toy :)
Thanks for releasing this! What is your use case for this rather than llama.cpp? For the on-device AI stuff I mostly do, llama.cpp is better because of GPU/metal offloading.
llama.cpp is great, if it fit your needs you can use it. I think at this point llama.cpp is effectively a platform that's hardened for production.
In its current form, I think of gemma.cpp is more of a direct model implementation (somewhere between the minimalism of llama2.c and the generality of ggml).
I tend to think of 3 modes of usage:
- hacking on inference internals - there's very little indirection, no IRs, the model is just code, so if you want to add support for your own runtime support for sparsity/quantization/model compression/etc. and demo it working with gemma, there's minimal barriers to do so
- implementing experimental frontends - i'll add some examples of this in the very near future. but you're free to get pretty creative with terminal UIs, code that interact with model internals like the KV cache, accepting/rejecting tokens etc.
- interacting with the model locally with a small program - of course there's other options for this but hopefully this is one way to play with gemma w/ minimal fuss.
That sounds interesting
So... llamafile release?
https://github.com/Mozilla-Ocho/llamafile
gguf files are out there, so anyone should be able to do this! are people looking for an "official" version?
ps i'm a fan of cosmopolitan as well.
Cosmopolitan is a fan of you :-) great work on gemma.cpp. I'm really impressed with it so far.
What's the reason to not integrate with llama.cpp instead of a separate app? In what ways this better than llama.cpp?
On uses, see https://news.ycombinator.com/item?id=39481554#39482302 and on llama.cpp support - https://news.ycombinator.com/item?id=39481554
Gemma support has been added to llama.cpp, and we're more than happy to see people use it there.
I think on uses you meant to link to https://news.ycombinator.com/item?id=39482581 child of https://news.ycombinator.com/item?id=39481554#39482302 ?
side note: imagine how gnarly those urls would be if HN used UUIDs instead of integers for IDs :-D
This is really cool, Austin. Kudos to your team!
Thanks so much!
Everyone working on this self-selected into contributing, so I think of it less as my team than ... a team?
Specifically want to call out: Jan Wassenberg (author of https://github.com/google/highway) and I started gemma.cpp as a small project just a few months ago + Phil Culliton, Dan Zheng, and Paul Chang + of course the GDM Gemma team.
Huge +1, this has definitely been a self-forming collective of people who love great AI, great research, and the open community.
Austin and Jan are truly amazing. The optimization work is genuinely outstanding; I get incredible CPU performance on Gemma.cpp for inference. Thanks for all of the awesomeness, Austin =)
Kudos on your release! I know this was just made available but
- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.
- The README should explicitly say somehere that there's no GPU support (at the moment)
- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.
- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?
- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?
Yes - thanks for pointing that out. The README is being updated, you can see an updated WIP in the dev branch: https://github.com/google/gemma.cpp/tree/dev?tab=readme-ov-f... and improving error messages is a high priority.
The weights should be the same across formats, but it's easy for differences to arise due to quantization and/or subtle implementation differences. Minor implementation differences has been a pain point in the ML ecosystem for a while (w/ IRs, onnx, python vs. runtime, etc.), but hopefully the differences aren't too significant (if they are, it's a bug in one of the implementations).
There were quantization fixes like https://twitter.com/ggerganov/status/1760418864418934922 and other patches happening, but it may take a few days for patches to work their way through the ecosystem.