Groq runs Mixtral 8x7B-32k with 500 T/s

Hi folks, I work for Groq. Feel free to ask me any questions.

(If you check my HN post history you'll see I post a lot about Haskell. That's right, part of Groq's compilation pipeline is written in Haskell!)

Thank you for doing this AMA

1. How many GroqCards are you using to run the Demo?

2. Is there a newer version you're using which has more SRAM (since the one I see online only has 230MB)? Since this seems to be the number that will drive down your cost (to take advantage of batch processing, CMIIW!)

3. Can TTS pipelines be integrated with your stack? If so, we can truly have very low latency calls!

*Assuming you're using this: https://www.bittware.com/products/groq/

1. I think our GroqChat demo is using 568 GroqChips. I'm not sure exactly, but it's about that number.

2. We're working on our second generation chip. I don't know how much SRAM it has exactly but we don't need to increase the SRAM to get efficient scaling. Our system is deterministic, which means no need for waiting or queuing anywhere, and we can have very low latency interconnect between cards.

3. Yeah absolutely, see this video of a live demo on CNN!

https://www.youtube.com/watch?t=235&v=pRUddK6sxDg

How much do 568 chips cost? What’s the cost ratio of it comparing to setup with roughly the same throughput using A100?

They’re for sale on Mouser for $20625 each https://www.mouser.com/ProductDetail/BittWare/RS-GQ-GC1-0109...

At that price 568 chips would be $11.7M

I presume that's because it's a custom asic not yet in mass production?

If they can get costs down and put more dies into each card then it'll be business/consumer friendly.

Let's see if they can scale production.

Also, where tf is the next coral chip, alphabet been slacking hard.

I think Coral has been taken to the wooden shed out back. Nothing new out of them for years sadly

Yeah. And it's a real shame bc even before LLMs got big I was thinking, couple generations down the line and coral would be great for some home automation/edge AI stuff.

Fortunately LLMs and hard work of clever peeps running em on commodity hardware are starting to make this possible anyway.

Because Google Home/Assistant just seems to keep getting dumber and dumber...

That seems to be per card instead of chip. I would expect it has multiple chips on a single card.

From the description that doesn't seem to be the case, but I don't know this product well

Accelerator Cards GroqCard low latency AI/ML Inference PCIe accelerator card with single GroqChip

Missed that! Thanks for pointing out!

Yeah, I don't know what the cost to us is to build out our own hardware but it's significantly less expensive than retail.

Thank you, that demo was insane!

Follow up (noob) question: Are you using a KV cache? That would significantly increase your memory requirements. Or are you forwarding the whole prompt for each auto-regressive pass?

You're welcome! Yes, we have KV cache. Being able to implement this efficiently in terms of hardware requirements and compute time is one of the benefits of our deterministic chip architecture (and deterministic system architecture).

Thanks again! Hope I'm not overwhelming but one more question: Are you decoding with batch size = 1 or is it more?

That's OK, feel free to keep asking!

I think currently 1. Unlike with graphics processors, which really need data parallelism to get good throughput, our LPU architecture allows us to deliver good throughput even at batch size 1.

Can you talk about the interconnect? Is it fully custom as well? How do you achieve low latency?

You can find out about the chip to chip interconnect from our paper below, section 2.3. I don't think that's custom.

We achieve low latency by basically being a software-defined architecture. Our functional units operate completely orthoganal to each other. We don't have to batch in order to achieve parallelism and the system behaviour is completely deterministic, so we can schedule all operations precisely.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

When will we be able to buy Groq accelerator cards that would be affordable for hobbyists?

We are prioritising building out whole systems at the moment I don't think we'll have a consumer level offering in the near future.

I will mention: A lot of innovation in this space comes bottom-up. The sooner you can get something in the hands of individuals and smaller institutions, the better your market position will be.

I'm coding to NVidia right now. That builds them a moat. The instant I can get other hardware working, the less of a moat they will have. The more open it is, the more likely I am to adopt it.

Definitely, that's why we've opened our API to everyone.

I don't think that quite does it. What I'd want -- if you want me to support you -- is access to the chip, libraries, and API documentation.

Best-case would be something I buy for <$2k (if out-of-pocket) or under $5k (if employer). Next best case would be a cloud service with a limited free tier. It's okay if it has barely enough quota that I can develop to it, but the quota should never expire.

(The mistake a lot of services make is to limit free tier to e.g. 30 day or 1 year, rather than hours/month; if I didn't get around to evaluating, switch employers, switch projects, etc. the free tier is gone).

I did sign up for your API service. I won't be able to use it in prod before your (very nice) privacy guarantees are turned into lawyer-compliant regulatory language. But it's an almost ideal fit for my application.

Yup, understood. Access to consumer hardware like this is not something that we provide at the moment, I'm afraid.

Don't blame you. Been at plenty of startups, resources are finite, and focus is important.

My only point was to, well, perhaps bump this up from #100 on your personal priority list perhaps to #87, to the limited extent that influences your business.

Groq Engineer here as well; we actually built our compiler to compile pytorch, TensorFlow, and Onnx natively, so a lot of the amazing work being done by y'all isn't building much of a moat. We got LLama2 working on our hardware in just a couple of days!

I don’t really understand this. If you are happy to buy a <2K card, then what does it matter if the service is paid or not? Clearly you have enough disposable income to not care about a ‘free’ tier.

There's two questions. Why local?

1) Privacy and security. I work with PII.

2) Low-level access and doing things the manufacturer did not intend, rather than just running inference on Mixtral.

3) Knowing it will be there tomorrow, and I'm not tied to you. I'm more than happy to pay for hosted services, so long as I know after your next pivot, I'm not left hanging.

Why free tier?

I'm only willing to subsidize my employer on rare occasions.

Paying $12 for a prototype means approvals and paperwork if employer does it. I won't do it out-of-pocket unless I'm very sure I'll use it. I've had free tier translate into millions of dollars of income for one cloud vendor about a decade ago. Ironically, it never happened again, since when I switched jobs, my free tier was gone.

The issue with their approach is that the whole LLM must fit in the chips to run at all: you need hundreds of cards to run a 7B LLM.

This approach is very good if you want to spend several millions building a large inference server to achieve the lowest latency possible. But it doesn't make sense for a lone customer buying a single card, since you wouldn't really be able to run anything on it.

You all seem like one of the only companies targeting low-latency inference rather than focusing on throughput (and thus $/inference) - what do you see as your primary market?

Yes, because we're one of the only companies whose hardware can actually support low latency! Everyone else is stuck with traditional designs and they try to make up for their high latency by batching to get higher throughput. But not all applications work with high throughput/high latency ... Low latency unlocks feeding the result of one model into the input of another model. Check out this conversational AI demo on CNN. You can't do that kind of thing unless you have low latency.

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s

Might be a bit out of context, but isn't the TPU also optimized for low latency inference? (Judging by reading the original TPU architecture paper here - https://arxiv.org/abs/1704.04760). If so, does Groq actually provide hardware support for LLM inference?

Jonathan Ross on that paper is Groq's founder and CEO. Groq's LPU is an natural continuation of the breakthrough ideas he had when designing Google's TPU.

Could you clarify your question about hardware support? Currently we build out our hardware to support our cloud offering, and we sell systems to enterprise customers.

Thanks for the quick reply! About hardware support, I was wondering if the LPU has a hardware instruction to compute the attention matrix similar to the MatrixMultiply/Convolve instruction in the TPU ISA. (Maybe a hardware instruction which fuses a softmax on the matmul epilogue?)

We don't have a hardware instruction but we do have some patented technology around using a matrix engine to efficiently calculate other linear algebra operations such as convolution.

Are you considering targeting consumer market? There are a lot of people throwing $2k-$4k into local setups and they primarily care about inference.

At the moment we're concentrating on building out our API and serving the enterprise market.

If I understand correctly, you're using specialized hardware to improve token generation speed, which is very latency bound on the speed of computation. However generating tokens only requires multiplying 1-dimensional matrices usually. If I enter a prompt with ~100 tokens then your service goes much slower. Probably because you have to multiply 2-dimensional matrices. What are you doing to improve the computation speed of prompt processing?

I don't think it should be quadratic in input length. Why do you think it is?

You can ask your website: "What is the computational complexity of self-attention with respect to input sequence length?"

It'll answer something along the lines of self-attention being O(n^2) (where n is the sequence length) because you have to compute an attention matrix of size n^2.

There are other attention mechanisms with better computational complexity, but they usually result in worse large language models. To answer jart: We'll have to wait until someone finds a good linear attention mechanism and then wait some more until someone trains a huge model with it (not Groq, they only do inference).

Changing the way transformer models works is orthogonal to gaining good performance on Mistral. Groq did great work reducing the latency considerably of generating tokens during inference. But I wouldn't be surprised if they etched the A matrix weights in some kind of fast ROM, used expensive SRAM for the the skinny B matrix, and sent everything else that didn't fit to good old fashioned hardware. That's great for generating text, but prompt processing is where the power is in AI. In order to process prompts fast, you need to multiply weights against 2-dimensional matrices. There is significant inequality in software implementations alone in terms of how quickly they're able to do this, irrespective of hardware. That's why things like BLAS libraries exist. So it'd be super interesting to hear about how a company like Groq that leverages both software and hardware specifically for inference is focusing on tackling its most important aspect.

One GrogCard has 230 MB SRAM, which is enough for every single weight matrix of Mixtral-8x7B. Code to check:

    import urllib.request, json, math

    for i in range(1, 20):
        url = f"https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/model-{i:05d}-of-00019.safetensors?download=true"

        with urllib.request.urlopen(url) as r:
            header_size = int.from_bytes(r.read(8), byteorder="little")
            header = json.loads(r.read(header_size).decode("utf-8"))
            for name, value in header.items():
                if name.endswith(".weight"):
                    shape = value["shape"]
                    mb = math.prod(shape) * 2e-6
                    print(mb, "MB for", shape, name)

tome's other comment mentions that they use 568 GroqChips in total, which should be enough to fit even Llama2-70B completely in SRAM. I did not do any math for the KV cache, but it probably fits in there as well. Their hardware can do matrix-matrix multiplications, so there should not be any issues with BLAS. I don't see why they'd need other hardware.

OK, thanks, that's useful to know. Personally I'm not involved directly in implementing the model, so I don't know what we do there.

all I know is that when I run llama.cpp a lot of the matrices that get multiplied have their shapes defined by how many tokens are in my prompt. https://justine.lol/tmp/shapes.png Notice how the B matrix is always skinny for generating tokens. But for batch processing of the initial prompt, it's fat. It's not very hard to multiply a skinny matrix but once it's fat it gets harder. Handling the initial batch processing of the prompt appears to be what your service goes slow at.

Friendly fyi - I think this might just be a web interface bug but but I submitted a prompt with the Mixtral model and got a response (great!) then switched the dropdown to Llama and submitted the same prompt and got the exact same response.

It may be caching or it didn't change the model being queried or something else.

Thanks, I think it's because the chat context is fed back to the model for the next generation even when you switch models. If you refresh the page that should erase the history and you should get results purely from the model you choose.

Appreciate the quick reply! That's interesting.

You're welcome. Thanks for reporting. It's pretty confusing so maybe we should change it :)

I've always liked how openrouter.ai does it

They allow you to configure chat participants (a model + params like context or temp) and then each AI answers each question independently in-line so you can compare and remix outputs.

openrouter dev here - would love to get Groq access and include it!

How does the Groq PCIE Card work exactly? Does it use system ram to stream the model data to the card? How many T/s could one expect with e.g. 36000Mhz DDR4 Ram?

We build out large systems where we stream in the model weights to the system once and then run multiple inferences on it. We don't really recommend streaming model weights repeatedly onto the chip because you'll lose the benefits of low latency.

How does that work when the card only has 230MB of SRAM?

We connect hundreds of chips across several racks with fast interconnect.

How fast is the memory bandwidth of that fast interconnect?

Have a look at section 2.3 of our paper. Between any two chips we get 100 Gbps. The overall bandwidth depends on the connection topology used. I don't know if we make that public.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

Hi. Are these ASICs only for LLMs or could they accelerate other kinds of models(vision) as well?

It's a general purpose compute engine for numerical computing and linear algebra, so it can accelerate any ML workloads. Previously we've accelerated models for stabilising fusion reactions and for COVID drug discovery

* https://alcf.anl.gov/news/researchers-accelerate-fusion-rese...

* https://wow.groq.com/groq-accelerates-covid-drug-discovery-3...

So, is this specific chip only for LLMs as the name LPU suggests Language Processing Unit, right?

The chip is capable of running general numerical compute, but because we're focusing almost entirely on LLMs at the moment we've branded it the LPU.

Is it possible to buy Groq chips and how much do they cost?

https://www.mouser.com/ProductDetail/BittWare/RS-GQ-GC1-0109...

Only $20,625.00!

Per chip? So the current demo with 568 chips costs.... $11,715,000?!?!

@tome for the deterministic system, what if the timing for one chip/part is off due to manufacturing/environmental factors (e.g., temperature) ? How does the system handle this?

We know the maximum possible clock drift and so we know when we need to do a resynchronisation to keep all the chips in sync. You can read about it in section 3.3 of our recent whitepaper: https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

Those sorts of issues are part of timing analysis for a chip, but once a chip's clock rate is set, they don't really factor in unless there is some kind of dynamic voltage/frequency scaling scheme going on. This chip probably does not do any of that and just uses a fixed frequency, so timing is perfectly predictable.

You're running fp32 models, fp16 or quantized?

FP16 for calculating all activations. Some data is stored as FP8 at rest.

What is the underlying architecture of the ASICs. Does it use systolic arrays?

Yes, our matrix engine is quite similar to a systolic array. You can find more details about our architecture in our paper:

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

Alright, I'll bite. Haskell seems pretty unique in the ML space! Any unique benefits to this decision, and would you recommend it for others? What areas of your project do/don't use Haskell?

Haskell is a great language for writing compilers! The end of our compilation pipeline is written in Haskell. Other stages are written in C++ (MLIR) and Python. I'd recommend anyone to look at Haskell if they have a compiler-shaped problem, for sure.

We also use Haskell on our infra team. Most of our CI infra is written in Haskell and Nix. Some of the chip itself was designed in Haskell (or maybe Bluespec, a Haskell-like language for chip design, I'm not sure).

are your accelerator chips designed in-house? or they're some specialized silicon or FPGPU or something that you wrote very optimized code for inference?

it's really amazing! the first time I tried the demo, I had to try a few prompts to believe it wasn't just an animation :)

Yup, custom ASIC, designed in-house, built into a system of several racks, hundreds of chips, with fast interconnect. Really glad you enjoyed it!

It seems like you are making general purpose chips to run many models. Are we at a stage where we can consider taping out inference networks directly propagating the weights as constants in the RTL design?

Are chips and models obsoleted on roughly the same timelines?

I think the models change far too quickly for that to be viable. A chip has to last several years. Currently we're seeing groundbreaking models released every few months.

Impressive speed. Are there any plans to run fine-tuned models?

Yes, we're working on a feature to give our partners the ability to deploy their own fine-tuned models.

How well would your hardware work for image/video generation?

It should work great as far as I know. We've implemented some diffusion models for image generation but we don't offer them at the moment. I'm not aware of us having implemented any video models.

As it works at inference do you think 'Representation Engineering ' could be applied to give a sort of fine-tuning ability? https://news.ycombinator.com/item?id=39414532

FYI, I only see a repeating animation and nothing else in my iPhone on lockdown mode, with Safari or Firefox.

How do I sign up for API access? What payment methods do you support?

The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM (20x faster than HBM3, just to be clear). Which means you need ~256 LPUs (4 full server racks of compute, each unit on the rack contains 8x LPUs and there are 8x of those units on a single rack) just to serve a single model [1] where as you can get a single H200 (1/256 of the server rack density) and serve these models reasonably well.

It might work well if you have a single model with lots of customers, but as soon as you need more than a single model and a lot of finetunes/high rank LoRAs etc., these won't be usable. Or for any on-prem deployment since the main advantage is consolidating people to use the same model, together.

[0]: https://wow.groq.com/groqcard-accelerator/

[1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334

Groq Engineer here, I'm not seeing why being able to scale compute outside of a single card/node is somehow a problem. My preferred analogy is to a car factory: Yes, you could build a car with say only one or two drills, but a modern automated factory has hundreds of drills! With a single drill, you could probably build all sorts of cars, but a factory assembly line is only able to make specific cars in that configuration. Does that mean that factories are inefficient?

You also say that H200's work reasonably well, and that's reasonable (but debatable) for synchronous, human interaction use cases. Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

Just curious, how does this work out in terms of TCO (even assuming the price of a Groq LPU is 0$)? What you say makes sense, but I'm wondering how you strike a balance between massive horizontal scaling vs vertical scaling. Sometimes (quite often in my experience) having a few beefy servers is much simpler/cheaper/faster than scaling horizontally across many small nodes.

Or I got this completely wrong, and your solution enables use-cases that are simply unattainable on mainstream (Nvidia/AMD) hardware, making TCO argument less relevant?

We're providing by far the lowest latency LLM engine on the planet. You can't reduce latency by scaling horizontally.

Distributed, shared memory machines used to do exactly that in HPC space. They were a NUMA alternative. It works if the processing plus high-speed interconnect are collectively faster than the request rate. The 8x setups with NVLink are kind of like that model.

You may have meant that nobody has a stack that uses clustering or DSM with low-latency interconnects. If so, then that might be worth developing given prior results in other low-latency domains.

I think existing players will have trouble developing a low latency solution like us whilst they are still running on non-deterministic hardware.

What do you mean by non-deterministic hardware? cuBLAS on a laptop GPU was deterministic when I tried it last iirc

Tip of the ice-berg.

DRAM needs to be refreshed every X cycles.

This means you don't know the time it takes to read from memory. You could be reading at a refresh cycle. This circuitry also adds latency.

OP says SRAM, which doesn't decay so no refreshing.

Timing can simply mean the FETs that make up the logic circuits of a chip. The transition from high to low and low to high has a minimum safe time to register properly...

Non-deterministic timing characteristics.

While you’re here, I have a quick, off-topic question. We‘ve seen incredible results with GPT3-176B (Davinci) and GPT4 (MoE). Making attempts at open models that reuse their architectural strategies could have a high impact on everyone. Those models took 2500-25000 GPU’s to train, though. It would be great to have a low-cost option for pre training Davinci-class models.

It would great if a company or others with AI hardware were willing to do production runs of chips sold at cost specifically to make open, permissive-licensed models. As in, since you’d lose profit, the cluster owner and users would be legally required to only make permissive models. Maybe at least one in each category (eg text, visual).

Do you think your company or any other hardware supplier would do that? Or someone sell 2500 GPU’s at cost for open models?

(Note to anyone involved in CHIPS Act: please fund a cluster or accelerator specifically for this.)

Great idea, but Groq doesn't have a product suitable for training at the moment. Our LPUs shine in inference.

Distributed, shared memory machines used to do exactly that in HPC space.

reformed HPC person here.

Yes, but not latency optimised in the case here. HPC is normally designed for throughput. Accessing memory from outside your $locality is normally horrifically expensive, so only done when you can't avoid it.

For most serving cases, you'd be much happier having a bunch of servers with a number of groqs in them, than managing a massive HPC cluster and trying to keep it both up and secure. The connection access model is much more traditional.

Shared memory clusters are not really compatible with secure enduser access. It is possible to partition memory access, but its something thats not off the shelf (well that might have changed recently.) Also, shared memory means shared fuckups.

I do get what you're hinting at, but if you want to serve low latency, high compute "messages" then discrete "APU" cards are a really good way to do it simply (assuming you can afford it). HPCs are fun, but its not fun trying to keep them up with public traffic on them

30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.

For these use cases, time to first byte is the most important metric, not total throughput.

It’s important…if you’re building a chatbot.

The most interesting applications of LLMs are not chatbots.

The most interesting applications of LLMs are not chatbots.

What are they then? Every use case I’ve seen is either a chatbot or like a copy editor which is just a long form chatbot.

Complex data tagging/enrichment tasks.

Obviously not op, but these days LLMs can be fuzzy functions with reliably structured output, and are multi-modal.

Think about the implications of that. I bet you can come up with some pretty cool use cases that don't involve you talking to something over chat.

One example:

I think we'll be seeing a lot of "general detectors" soon. Without training or predefined categories, get pinged when (whatever you specify) happens. Whether it's a security camera, web search, event data, etc

The most interesting applications of LLMs are not chatbots.

In your opinion, what are the most interesting?

Are there voice responses in the demo? I couldn't find em?

Here's a live demo of CNN of Groq plugged into a voice API

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s

Thanks, that's pretty impressive. I suppose with blazing fast token generation now things like diarisation and the actual model are holding us back.

Once it flawlessly understands when it is being spoken to/if it should speak based on the topic at hand (like we do) then it'll be amazing.

I wonder if ML models can feel that feeling of wanting to say something so bad but having to wait for someone else to stop talking first ha ha.

Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia

I built one, should be live soon ;-)

Exciting! Looking forward to seeing it.

I have one, with 13B, on a 5-year-old 48GB Q8000 GPU. It’s also can see, it’s LLaVA. And it is very important that it is local, as privacy is important and streaming images to the cloud is time consuming.

You only need a few tokens, not the full 500 tokens response to run TTS. And you can pre-generate responses online, as ASR is still in progress. With a bit of clever engineering the response starts with virtually no delay, the moment its natural to start the response.

Did you find anything cheaper for local installation?

You can’t scale horizontally forever because of communication. I think HBM would provide a lot more flexibility with the number of chips you need.

Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

Is your version of that on a different page from this chat bot?

If you want low latency you have to be really careful with HBM, not only because of the delay involved, but also the non-determinacy. One of the huge benefits of our LPU architecture is that we can build systems of hundreds of chips with fast interconnect and we know the precise timing of the whole system to within a few parts per million. Once you start integrating non-deterministic components your latency guarantees disappear very quickly.

From a theoretical perspective, this is absolutely not true. Asynchronous logic can achieve much lower latency guarantees than synchronous logic.

Come to think of it, this is one of the few places where asynchronous logic might be more than academic... Async logic is hard with complex control flows, which deep learning inference does not have.

(From a practical perspective, I know you were comparing to independently-clocked logic, rather than async logic)

(Groq Employee) You're right - we are comparing to independently-clocked logic.

I wonder whether async logic would be feasible for reconfigurable "Spatial Processor" type architectures [1]. As far as LPU architectures go, they fall in the "Matrix of Processing Engines"[1] family of architectures, which I would naively guess is not the best suited to leverage async logic.

1: I'm using the "Spatial Processor" (7:14) and "Matrix of Processing Engines" (8:57) terms as defined in https://www.youtube.com/watch?v=LUPWZ-LC0XE. Sorry for a video link, I just can't think of another single reference that explains the two approaches.

Curiously, almost all of this video is mostly covered by computer architectures lit in the late 90's early 00's. At the time, I recall Tom Knight had done most of the analysis in this video, but I don't know if he ever published it. It was extrapolating into the distant future.

To answer your questions:

- Spatial processors are an insanely good fit for async logic

- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.

In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.

Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).

TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.

This was sort of the dream of KNL but today I noticed

    Xeon Phi CPUs support (a.k.a. Knight Landing and Knight Mill) are marked as deprecated. GCC will emit a warning when using the -mavx5124fmaps, -mavx5124vnniw, -mavx512er, -mavx512pf, -mprefetchwt1, -march=knl, -march=knm, -mtune=knl or -mtune=knm compiler switches. Support will be removed in GCC 15.

the issue was that coordinating across this kind of hierarchy wasted a bunch of time. If you already knew how to coordinate, mostly, you could instead get better performance

you might be surprised but we're getting to the point that communicating over a super computer is on the same order of magnitude as talking across a numa node.

I actually wasn't so much talking from that perspective, as simply from the perspective of the design of individual pieces. There were rather clever things done in e.g. older multipliers or adders or similar which, I think, could apply to most modern parallel architectures, be that GPGPU, SP, MPE, FPGA, or whatever, in order to significantly increase density at a cost of slightly reduced serial performance.

For machine learning, that's a good tradeoff.

Indeed, with some of the simpler architectures, I think computation could be moved into the memory itself, as long dreamed of.

(Simply sticking 32,000 SA-110 processors on a die would be very, very limited by interconnect; there's a good reason for the types of architectures we're seeing not being that)

Truth is that there is another startup called graph core that is doing exactly that, and also a really big chip

They do what you were talking about, not what I was.

They seem annoying. "The IPU has a unique memory architecture consisting of large amounts of In-Processor-Memory™ within the IPU made up of SRAM (organised as a set of smaller independent distributed memory units) and a set of attached DRAM chips which can transfer to the In-Processor-Memory via explicit copies within the software. The memory contained in the external DRAM chips is referred to as Streaming Memory™."

There's a ™ every few words. Those seem like pretty generic terms. That's their technical documentation.

The architecture is reminiscent of some ideas from circa-2000 which didn't pan out. It reminds me of Tilera (the guy who ran it was the Donald Trump of computer architectures; company was acquihired by EZchip for a fraction of the investment which was put into it, which went to Mellanox, and then to NVidia).

Sweet, thanks! It seems like this research ecosystem was incredibly rich, but Moore's law was in full swing, and statically known workloads weren't useful at the compute scale of back then.

So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.

Lots of things were useful to compute.

The problem was

1) If you took 3 years longer to build a SIMD architecture than Intel to make a CPU, Intel would be 4x faster by the time you shipped.

2) If, as a customer, I was to code to your architecture, and it took me 3 more years to do that, by that point, Intel would be 16x faster

And any edge would be lost. The world was really fast-paced. Groq was founded in 2016. It's 2024. If it was still hayday of Moore's Law, you'd be competing with CPUs running 40x as fast as today's.

I'm not sure you'd be so competitive against a 160GHz processor, and I'm not sure I'd be interested knowing a 300+GHz was just around the corner.

Good ideas -- lots of them -- lived in academia, where people could prototype neat architectures on ancient processes, and benchmark themselves to CPUs of yesteryear from those processes.

I don't know about HBM specifically, but DDR and GDDR at a protocol level are both deterministic. It's the memory controller doing a bunch of reordering that makes them non-deterministic. Presumably, if that is the reason you don't like DRAM, you could build your compiler to be memory-layout aware and have the memory controller issue commands without reordering.

Presumably with dram you also have to worry about refreshes, which can come along at arbitrary times relative to the workload.

You can control when those happen, too.

not without affecting performance though? If you delay refreshes, this lowers performance as far as I remember...

Control of all of this can come at a performance cost, but in the case of DRAM refreshes, it doesn't lower performance if you don't do them, it loses data. Nominally, you could do your refreshes closer together and as long as you know that the rows being refreshed will be idle and you have spare time on the bus, you're ok.

That could be possible. It's out of my area of expertise so I can't say for sure. My understanding was HBM forces on you specific access patterns and non-deterministic delays. Our compiler already deals with many other forms of resource-aware scheduling so it could take into account DRAM refreshes easily, so I feel like there must be something else that makes SRAM more suitable in our case. I'll have to leave that to someone more knowledgeable to explain though ...

Surely once you're scaling over multiple chips/servers/racks you're dealing with retries and checksums and sequence numbers anyway? How do you get around the non-determinacy of networking beyond just hoping that you don't see any errors?

Our interconnect between chips is also deterministic! You can read more about our interconnect, synchronisation, and error correction in our paper.

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

Groq states in this article [0] that they used 576 chips to achieve these results, and continuing with your analysis, you also need to factor in that for each additional user you want to have requires a separate KV cache, which can add multiple more gigabytes per user.

My professional independent observer opinion (not based on my 2 years of working at Groq) would have me assume that their COGS to achieve these performance numbers would exceed several million dollars, so depreciating that over expected usage at the theoretical prices they have posted seems impractical, so from an actual performance per dollar standpoint they don’t seem viable, but do have a very cool demo of an insane level of performance if you throw cost concerns out the window.

[0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...

Thomas, I think for full disclosure you should also state that you left Groq to start a competitor (a competitor which doesn't have the world's lowest latency LLM engine nor a guarantee to match the cheapest per token prices, like Groq does.).

Anyone with a serious interest in the total cost of ownership of Groq's system is welcome to email contact@groq.com.

I thought that was clear through my profile, but yes, Positron AI is focused on providing the best performance per dollar while providing the best quality of service and capabilities rather than just focusing on a single metric of speed.

A guarantee to match the cheapest per token prices is sure a great way to lose a race to the bottom, but I do wish Groq (and everyone else trying to compete against NVIDIA) the greatest luck and success. I really do think that the great single batch/user performance by Groq is a great demo, but is not the best solution for a wide variety of applications, but I hope it can find its niche.

What happened to Rex? Did it hit production or get abandoned?

It was also on my list of things to consider modifying for an AI accelerator. :)

Long story, but technically REX is still around but has not been able to continue to develop due to lack of funding and my cofounder and I needing to pay bills. We produced initial test silicon, but due to us having very little money after silicon bringup, most of our conversations turned to acquihire discussions.

There should be a podcast release (https://microarch.club/) in the near future that covers REX's history and a lot of lessons learned.

I think that just means it’s for people that really want it?

John doe and his friends will never have a need to have their fart jokes generated at this speed, and are more interested in low costs.

But we’d recently been doing call center operations and being able to quickly figure out what someone said was a major issue. You kind of don’t want your system to wait for a second before responding each time. I can imagine it making sense if it reduces the latency to 10ms there as well. Though you might still run up against the ‘good enough’ factor.

I guess few people want to spend millions to go from 1000ms to 10ms, but when they do they really want it.

The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM [...]

IDGAF about any of that, lol. I just want an API endpoint.

480 tokens/sec at $0.27 per million tokens? Sign me in, I don't care about their hardware, at all.

there are providers out there offering for $0 per million tokens, that doesn't mean it is sustainable and won't disappear as soon as the VC well runs dry. Am not saying this is the case for Groq, but in general you probably should care if you want to build something serious on top of anything.

(Groq Employee) Agreed, one should care, and especially since this particular service is very differentiated by its speed and has no competitors.

That being said, until there's another option at anywhere that speed.. That point is moot, isn't it :)

For now, Groq is the only option that can let you build an UX with near-instant response times. Or a live agents that help with a human-to-human interaction. I could go on and on about the product categories this opens.

Why go so fast? Aren't Nvidias products fast enough from a TPS perspective?

OpenAI have a voice powered chat mode in their app and there's a noticeable delay of a few seconds between finishing your sentence and the bot starting to speak.

I think the problem is that for realistic TTS you need quite a few tokens because the prosody can be affected by tokens that come a fair bit further down the sentence, consider the difference in pitch between:

"The war will be long and bloody"

"The war will be long and bloody?"

So to begin TTS you need quite a lot of tokens, which in turn means you have to digest the prompt and run a whole bunch of forward passes before you can start rendering. And of course you have to keep up with the speed of regular speech, which OpenAI sometimes struggles with.

That said, the gap isn't huge. Many apps won't need it. Some use cases where low latency might matter:

- Phone support.

- Trading. Think digesting a press release into an action a few seconds faster than your competitors.

- Agents that listen in to conversations and "butt in" when they have something useful to say.

- RPGs where you can talk to NPCs in realtime.

- Real-time analysis of whatever's on screen on your computing device.

- Auto-completion.

- Using AI as a general command prompt. Think AI bash.

Undoubtably there will be a lot more though. When you give people performance, they find ways to use it.

You've got good ideas. What I like to personally say is that Groq makes the "Copilot" metaphor real. A copilot is supposed to be fast enough to keep up with reality and react live :)

Groq devices are really well set up for small-batch-size inference because of the use of SRAM.

I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.

I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.

    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1

I guess if you don't have any extra junk you can pack more processing into the chip?

(Groq Employee) Yes! Determinism + Simplicity are superpowers for ALU and interconnect utilization rates. This system is powered by 14nm chips, and even the interconnects aren't best in class.

We're just that much better at squeezing tokens out of transistors and optic cables than GPUs are - and you can imagine the implications on Watt/Token.

Anyways.. wait until you see our 4nm. :)

I've been thinking the same but on the other hand, that would mean they are operating at a huge loss which doesn't scale

(Groq Employee) It's hard to discuss Tok/sec/$ outside of the context of a hardware sales engagement.

This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!

As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.

more than a single model and a lot of finetunes/high rank LoRAs

I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.

The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.

There's also papers for hosting full-parameter fine-tuned models: https://arxiv.org/abs/2312.05215

Disclaimer: I'm one of the authors.

This actually already exists! We did a writeup of the relevant optimizations here: https://openpipe.ai/blog/s-lora

I recall a recent discussion about a technique to load the diff in weights between a lora and base model, zip it and transfer it on a per-needs basis.

I honestly don't see the problem.

"just to serve a single model" could be easily fixed by adding a single LPDDR4 channel per LPU. Then you can reload the model sixty times per second and serve 60 different models per second.

per-chip compute is not the main thing this chip innovates for fast inference, it is the extremely fast memory bandwith. when you do that, you'll loose all of that and will be much worse off than any off the shelf accelerators.

load model, compute a 1k token response (ie, do a thousand forward passes in sequence, one per token), load a different model, compute a response,

I would expect the model loading to take basically zero percent of the time in the above workflow

I just want to say that this is one of the most impressive tech demos I’ve ever seen in my life, and I love that it’s truly an open demo that anyone can try without even signing up for an account or anything like that. It’s surreal to see the thing spitting out tokens at such a crazy rate when you’re used to watching them generate at one less than one fifth that speed. I’m surprised you guys haven’t been swallowed up by Microsoft, Apple, or Google already for a huge premium.

I have it on good authority Apple was very closing to acquiring Groq

If this is true, expect a call from the SEC...

Even if it isn't true.

Disclosing inside information is illegal, _even if it is false and fabricated_, if it leads to personal gains.

You have to prove the OP had personal gains. If he's just a troll, it will be difficult.

You also have to be an insider.

If I go to a bar, and overhear a pair of Googlers discussing something secret and overhear it, I can:

1) Trade on it.

2) Talk about it.

Because I'm not an insider. On the other hand, if I'm sleeping with the CEO, I become an insider.

Not a lawyer. Above is not legal advice. Just a comment that the line is much more complex, and talking about a potential acquisition is usually okay (if you're not under NDA).

It doesn't matter if you overheard it at a bar or if you're just some HN commenter posting completely incorrect legal advice; the law prohibits trading on material nonpublic information.

I would pay a lot to see you try your ridiculous legal hokey-pokey on how to define an "insider."

Had insider trading training, and yes, that's the gist of it. If you know or presume that the information is material (makes a difference) and not public, it's illegal to act on it.

Roughly, it's illegal only if you have some duty not to trade on it. If you acquired the information without misappropriating it (like overhearing it from strangers in a normal public bar), then you're free to trade.

https://corpgov.law.harvard.edu/2017/01/18/insider-trading-l...

There's no reason for normal corporate training to discuss that element, because an employee who trades their employer's stock based on MNPI has near-certainly misappropriated it. The question of whether a non-employee has misappropriated information is much more complex, though.

Training is designed to protect the corporation, not to provide accurate lega ladvice. That's true of most corporate trainings, for that matter, be that bribes / corruption, harassment, discrimination, or whatnot. Corporations want employees very far from the line.

That's the right way to run them.

If you want more nuance, talk to a lawyer or read case law.

Generally, insider trading requires something along the lines of a fiduciary duty to keep the information secret, albeit a very weak one. I'm not going to slice that line, but you see references in-thread.

If you did hear it in a bar, could you tweet it out before your trade, so the information is made public?

If you hear it in a bar it’s already public.

I really doubt that you can make yourself something public so that you can later act on it.

the law prohibits trading on material nonpublic information.

Isn’t it public information the moment it’s said audibly in a public space?

No. It's not. However, as pointed out elsewhere, you can trade on many types of non-public information. Indeed, hedge funds engage in all sorts of surveillance in order to get non-public material information to trade on which gives them a proprietary edge.

You just can't trade on insider information.

That's a very complex legal line.

You're an idiot.

https://www.kiplinger.com/article/investing/t052-c008-s001-w...

Case #1.

Unless you earn enough money to retain good lawyers and are prepared to get into complicated legal troubles, getting sued isn't a great outcome even if you win.

The prudent thing to do is to stay away from anything that might make you become a target of investigation, unless the gains outweigh the risk by a significant margin.

Feel free to share some legal precedence where this situation has fared poorly for someone who "overheard it at a bar".

It'd also be a good time to watch you lose all that money on your hokey-pokey assumption.

No, a bar is a public place so this counts as a public disclosure. The people having the conversation would be in trouble with the SEC for making a disclosure in this manner.

just so you know no one's ever been taken to court for discussing the law, it doesn't matter that you're not a lawyer, it's basically a meme

Are you a lawyer?

Just so you know, plenty of people have been penalized for practicing law without a license. If someone engages in insider trading based on a mistake you made on the internet, you can be liable.

In my jurisdiction, that would involve me taking money (not just talking on the internet), so I'm not at risk, but in plenty of states, you can be. A lot of this hinges on the difference between "legal information" (which is generic) and "legal advice" (which is specific).

There are whole law review articles on this, which I read more than a decade ago, nerding on something related.

But that's beside the point. A major reason for the disclaimer is that people SHOULD be aware of my level of expertise. I do the same on technical posts too. I'll disclaim whether e.g. I have world-class expertise in a topic, worked in an adjacent domain, or read a blog post somewhere (and wish others did too). It's helpful to know people's backgrounds. I am NOT a lawyer specializing in securities law. I know enough to tell people the line is more complex than trading on non-public information, but I am utterly unqualified to tell people where that line is. If you're planning to do that, you SHOULD NOT rely on it. Either read relevant case law, talk to a genuine lawyer who specializes in this stuff, or find some other way to educate yourself on whether what you're doing is okay.

So it does matter I'm not a lawyer, if not for the reasons you mentioned.

TIL that SEC has authority over private company dealings wrt sale of shares[1].

[1] https://www.sec.gov/education/capitalraising/building-blocks...

Not if poster is in a crashing plane...

Really glad you like it! We've been working hard on it.

The speed part or the being swallowed part?

The speed part. We're not interested in being swallowed. The aim is to be bigger than Nvidia in three years :)

Can you warn us pre-IPO?

I'm sure you'll hear all about our IPO on HN :) :)

Is Sam going to give you some of his $7T to help with that?

Go for it!

Yes please

Is this useful for training as well as running a model. Or is this approach specifically for running an already-trained model faster?

Currently graphics processors work well for training. Language processors (LPUs) excel at inference.

Did you custom build those Language processors for this task? Or did you repurpose something already existing? I have never heard anyone use ‘Language processor’ before.

The chips are built for general purpose low latency, high throughput numerical compute.

In principle, training is basically the same as running inference but iteratively, in practice training would use a different software stack.

Training requires a lot more memory to keep gradients + gradient stats for the optimizer, and needs higher precision weights for the optimization. It's also much more parallelizable. But inference is kind of a subroutine of training.

Perplexity Labs also has an open demo of Mixtral 8x7b although it's nowhere near as fast as this.

https://labs.perplexity.ai/

Poe has a bunch of them, including Groq as well!

Sure, but the responses are very poor compared to MS tools.

why sell? it would be much more delightful to beat them on their own game?

ok... why tho? genuinely ignorant and extremely curious.

what's the TFLOPS/$ and TFLOPS/W and how does it compare with Nvidia, AMD, TPU?

from quick Googling I feel like Groq has been making these sorts of claims since 2020 and yet people pay a huge premium for Nvidia and Groq doesn't seem to be giving them much of a run for their money.

of course if you run a much smaller model than ChatGPT on similar or more powerful hardware it might run much faster but that doesn't mean it's a breakthrough on most models or use cases where latency isn't the critical metric?

Very impressive looking! Just wanted to caution it's worth being a bit skeptical without benchmarks as there are a number of ways to cut corners. One prominent example is heavy model quantization, which speeds up the model at a cost of model quality. Otherwise I'd love to see LLM tok/s progress exactly like CPU instructions/s did a few decades ago.

At least for the earlier Llama 70B demo, they claimed to be running unquantized. https://twitter.com/lifebypixels/status/1757619926360096852

Update: This comment says "some data is stored as FP8 at rest" and I don't know what that means. https://news.ycombinator.com/item?id=39432025

The weights are quantized to FP8 when they're stored in memory, but all the activations are computed at full FP16 precision.

Can you explain if this affects quality relative to fp16? And is mixtral quantized?

We don't think so, but you be the judge! I believe we quantize both Mixtral and Llama 2 in this way.

Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)

I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.

Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.

What quantified testing would you like to see? We've had a lot of very good feedback from our users, particularly about Mixtral.

Nothing really wrong with FP8 IMO, it performs pretty damn well usually within 98% while significantly reducing memory usage.

As tome mentioned we don’t quantize, all activations are FP16

And here are some independent benchmarks https://artificialanalysis.ai/models/llama-2-chat-70b

Jesus Christ, these speeds with FP16? That is simply insane.

Ask how much hardware is behind it.

All that matters is the cost. Their price is cheap, so the real question is whether they are subsidizing the cost to achieve that price or not.

All that matters is the cost.

Not really, sustainability matters, if they are the only game in town, you want to know that game isn't going to end suddenly when their runway turns into a brick wall.

Cost, not price.

The point of asking how much hardware is to estimate the cost? (Both capital and operational, i.e. power)

As a fellow scientist I concur with the approach of skepticism by default. Our chat app and API are available for everyone to experiment with and compare output quality with any other provider.

I hope you are enjoying your time of having an empty calendar :)

Wait you have an API now??? Is it open, is there a waitlist? I’m on a plane but going to try to find that on the site. Absolutely loved your demo, been showing it around for a few months.

There is an API and there is a waitlist. Sign up at http://wow.groq.com/

Maybe I'm stretching the analogy too far, but are we in the transistor regime of LLMs already? Sometimes I see these 70 billion parameter monstrosities and think we're still building ENIAC out of vacuum tubes.

In other words, are we ready to steadily march on, improving LLM tok/s year by year, or are we a major breakthrough or two away before that can even happen?

The thing is that tokens aren't an apples to apples metric.... Stupid tokens are a lot faster than clever tokens. I'd rather see token cleverness improving exponentially....

tangent: Great to see you again on HN!

As part of our benchmarking of Groq we have asked Groq regarding quantization and they have assured us they are running models at full FP-16. It's a good point and important to check.

Link to benchmarking: https://artificialanalysis.ai/ (Note question was regarding API rather than their chat demo)

Does this make it practical to run LLMs on mobile devices? I wonder about the power consumption and if it can make sense to have it integrated in some future mobile devices. Or maybe have a dedicated storage, RAM and processing cores that goes as an USB-C add-on? A case with integrated battery and this chip?

I'm dreaming of having LLMs on anything. Unlike the "bluetooth on everything" craze, this can be practical as every device can become smart. Remember how some British researchers made a self driving car using an LLM? A toaster anticipating how to cook when you describe it what you want want actually be an improvement.

I assume this is a million-dollar rack of custom chips so it's probably not coming to mobile any time soon.

Well, currently its entirely possible to run these models on iPhones. It's just not practical because it eats all the resources and the battery when slowly generating the output.

Therefore if Groq has achieved significant efficiency improvements, that its, they are not getting that crazy speed by enormous power consumption then maybe they can eventually build low power mass produced cutting edge fabbed chips that run at acceptable speed?

The thing is, I don't see any efficiency improvements. I see models running fast on very expensive hardware using techniques that don't scale down.

Care to explain? Are they using 10x energy for 10x speed improvements?

They're using hundreds of chips. Based on the data sheet I would estimate this demo uses 173 KW. It may be 100x energy to get 10x speedup.

100s of chips for who knows how many clients. The mobile phone will have to do calculations just for 1 client.

Yes, we pipeline requests so multiple users are being handled by the same hardware at one time.

Thanks for the clarification. So, would you say that Groq has a potential to have let's say OpenAI speeds on handheld devices at reasonable energy consumption? Or is that not really what this tech's strength is maybe?

The industry as a whole is a very long way away from that. The power requirements are too high for mobile.

I think the limitation is chip size/cost. SRAM is a lot less dense than RAM. According to Google this is typically used for registers and caches, which are only megabytes large.

Remember how some British researchers made a self driving car using an LLM?

No? Do you mean actual, full self driving on normal roads in traffic?

Yes, IIRC they reason on the car actions using LLMs. They still use image processing but once you identify the objects in the scene, the LLM interprets and decides what to do with the car.

I'm not sure which one was it though(Ghost Autonomy maybe?).

Do you have a source? Because that actually, properly working would be headline global news and would value the company in the billions.

It was discussed here on HN, that's how I know about it.

I found a few things when searched around but not sure which one was the one I recall.

Anyway, here is a video from one: https://www.youtube.com/watch?v=C2rbym6bXM0

Here is a paper discussing something similar: https://arxiv.org/abs/2307.07162

The description for that video says

Ghost Autonomy’s MLLM-based capabilities are currently in development. These video and image examples show MLLM-based analysis of driving scenes captured from Ghost vehicles driving in both autonomous and conventional mode. MLLM-based reasoning is not yet being returned to the car to impact actual driving maneuvers.

So the model discussed is not doing any driving whatsoever. This is not self-driving at any level.

Then its not the one I remember maybe.

Yeah just offload the compute onto the cloud.

Its too unreliable, too restricted and too not-private.

ChatGPT stopped processing images for me, trying to get help but support doesn't appear to be very fast, they asked for more info but not heard back since.

Its too restricted, can't do anything on hard topics. It doesn't work when you try to work out exploits or dangers in a system for example.

Its not private, they say they don't train on API requests but companies steer clear when it comes to send sensitive data.

The model being too restrictive does seem to be a good point.

Do you think there are less restrictive models hosted on poe.com?

I don't think we've put a GroqChip in a mobile device yet. Interesting idea!

Sorry, I'm a bit naïve about all of this.

Why is this impressive? Can this result not be achieved by throwing more compute at the problem to speed up responses? Isn't the fact that there is a queue when under load just indicative that there's a trade-off between "# of request to process per unit of time" and "amount of compute to put into a response to respond quicker"?

https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/do...

This chart from NVIDIA implies their H100 runs llama v2 70B at >500 tok/s.

Scaling up compute can improve throughput, but can't easily improve latency between tokens. Generation is usually bottlenecked by the time it takes to go through the network for each token. To speed that up, you need to perform these computations faster, which is a hard problem after you've exhausted all the obvious options (use the fastest accelerator you can find, cache what you can etc).

Yeah. That makes sense, thank you for clarifying. I updated my original post with a chart from NVIDIA which highlights the H100's capabilities. It doesn't seem unreasonable to expect a 7B model to run at 500 tok/s on that hardware.

This is a 50B model. (Mixtral 8x7b)

Oh, sorry, I assumed the 8 was for quantization. 8x7b is a new syntax for me.

Still, the NVIDIA chart shows Llama v2 70B at 750 tok/s, no?

I guess that's total throughput, rather than per user? You can increase total throughput by scaling horizontally. You can't increase throughput per user that way.

At batch size 1 LLMs are memory bandwidth bound, not compute bound…as in you spend most time waiting for model weights to load from vram. At higher batch sizes this flips.

But this is why Groq is built around large numbers of chips with small amount of very fast sram.

LLM inference is inherently a sequential problem. You can't speed it up by doing more in parallel. You can't generate the 101st token before you've generated the 100th.

Technically, I guess you can use speculative execution to speed it up, and in that way take a guess at what the 100th token will be and start on the 101st token at the same time? Though it probably has it's own unforeseen challenges.

Everything is predictable with enough guesses.

People are pretty cagey about what they use in production, but yes, speculative sampling can offer massive speedups in inference

They’re using several hundred cards here. Clearly there is ‘something’ that can be done in parallel.

I think NVidia is listing max throughput in terms of batching, so e.g. 50 tok/s for 10 different prompts at the same time. Groq LPUs definitely outerform an H100 in raw speed.

But fundamentally it's a system that only has 10x the speed for 500x the price, made by a company that runs a blockchain and is trying to heavily market what were intended to be crypto mining chips for LLM inference. It's really quite a funny coincidence that when someone amazed posts this weekly link there's an army of Groq engineers at the ready in the comments ready to say everything and anything.

Groq does not run a blockchain and our chips were never intended for crypto mining.

https://www.livecoinwatch.com/price/GroqAI-GROQ

I suppose that's someone else then? If that's true, then with this and Elon's Grok it's surprising the US Patent office hasn't taken your trademark away yet for not adequately defending it from infringement.

I don't know what that is. It's nothing to do with Groq Inc.

There's a difference between token throughput and latency. Token throughput is the token throughput of the whole GPU/system and latency is the token throughput for an individual user. Groq offers extremely low latency (aka extremely high token throughput per user) but we still don't have numbers on the token throughput of their entire system. Nvidia's metrics here on the other hand, show us the token throughput of the whole GPU/system. So, in reality, while you might be able to get 1.5k t/s on an H100, the latency (token throughput per user) will be something much lower like 20 t/s.

The really important metric to look for is cost per token because even though Groq is able to run at low latency, that doesn't mean it's able to do it cheaply. Determining the cost per token can be done many ways but a useful way for us is approximately the cost of the system divided by the total token throughput of the system per second. We don't have the total token throughput per second of Groq's system so we can't really say how efficient it is. It could very well be that Groq is subsidizing the cost of their system to lower prices and gain PR and will increase their prices later on.

https://wow.groq.com/artificialanalysis-ai-llm-benchmark-dou...

Seems to have it. Looks cost competitive but a lot faster.

People are using throughput and latency differently in different locations/contexts. Here they are referring to token throughput per user and first token/chunk latency. They don't mention the token throughput of the entire 576-chip system[0] that runs Llama 2 70b which would be the number we're looking for.

[0] https://news.ycombinator.com/item?id=38742581

I guess it depends on how much the infrastracture from TFA costs, as the H100 only costs ~$3300 to produce, but gets sold for ~$30k on average.

https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp...

Not clear if it is due to Groq or to Mixtral, but confident hallucinations are there.

We run the open source models that everyone else has access to. What we're trying to show off is our low latency and high throughput, not the model itself.

But if the model is useless/full of hallucinations, why does the speed of its output matter?

"generate hallucinated results, faster"

No, it is "do whatever you were already doing with ML, faster"

This question seems either from a place of deep confusion or is in bad faith. This post is about hardware. The hardware is model independent.* Any issues with models, like hallucinations, are going to be identical if it is run on this platform or a bunch of Nvidia GPUs. Performance in terms of hardware speed and efficiency are orthogonal to performance in terms of model accuracy and hallucinations. Progress on one axis can be made independently to the other.

* Technically no, but close enough

Well ok, Groq provides lower latency cheaper access to the same models of questionable quality.

Is this not putting lipstick on a pig scenario? I suppose more of a question to pig buyers.

Okay. How about this: Someone posts to HN about an amazing new battery technology, which they demo by showing an average-sized smartphone watching TikTok endlessly scroll for over 500 hours on a single charge.

Then someone comments that TikTok is a garbage fire and a horrible corrupting influence, yadda yadda, all that stuff. They ask: what is the point of making phones last longer just to watch TikTok? They say this improved efficiency in battery tech is just putting lipstick on a pig.

That's you in this thread. That's the kind of irrelevant non-contribution you are making here.

Perhaps your analogy reveals more then you intended.

What does it tell you about the new technology if the best vehicle to demonstrate it is TikTok?

Batteries are useful. The majority of LLMs are not?

They’re probably in the business of being the hardware provider. Best thing would be if Microsoft buys a lot of their chips and that way chatgpt is actually sped up. It’s basically model independent

Mixtral 8x7b is competitive with ChatGPT 3.5 Turbo so I'm not sure why you are being so dismissive.

https://chat.lmsys.org/ check the leaderboard.

https://preview.redd.it/jrb4e1wr9ll31.png?auto=webp&s=ccb5a8...

https://i.imgur.com/Jkdtlox.png

I asked it to come up with name ideas for a company and it hallucinated them successfully :) I think the trick is to know what prompts will likely to yield results that are not likely to be hallucinated. In other contexts it's a feature.

A bit of a softball don't you think? The initial message suggests "Are you ready to experience the world's fastest Large Language Model (LLM)? We'd suggest asking about a piece of history"

So I did.

At top left hand corner you can change the model to Llama2 70B Model.

This is pretty sweet. The speed is nice but what I really care about is you bringing the per token cost down compared with models on the level of mistral medium/gpt4. GPT3.5 is pretty close in terms of cost/token but the quality isn't there and GPT4 is overpriced. Having GPT4 quality at sub-gpt3.5 prices will enable a lot of things though.

I wonder if Gemini Pro 1.5 will act as a forcing function to lower GPT4 pricing.

Is that available via an API now?

Kind of, it's in a "Private Preview" with a waitlist.

And in non EU countries.

Via GCP only?

What's the difference in your own words/opinion in quality between GPT-3.5 and GPT-4? For what usecases?

GPT3.5 is great at spitting out marketing babble, summarizing documents and performing superficial analysis but it doesn't take style prompts as well as gpt-4 and its reasoning is significantly worse when you want it to chain of thought follow a complex process while referencing context guidance.

You seem to be implying that Mistral Medium is on the same level as GPT-4?

Mixtral's quality is definitely up there with Gpt3.5. Specifically for coding, i consider them almost equivalent in quality. In fact Mixtral 8x7 is starting to be my go-to coding assistant instead of Gpt. It is fast, it is accurate, and i think i like his responses better than Gpt.

Reducing LLM size almost 10 times in the span of a little more than a year, that's great stuff. Next step i think is 3 billion parameters MoE with 20 experts.

GPT-4 is overpriced vs what?

So this has nothing to do with `Grok`, the model provided by x.ai?

EDIT: Tried using it, very impressed with the speed.

Yeah, it's nothing to do with Elon and we (Groq) had the name first. It's a natural choice of name for something in the field of AI because of the connections to the hacker ethos, but we have the trademark and Elon doesn't.

https://wow.groq.com/hey-elon-its-time-to-cease-de-grok/

I mean it sucks that Elon went and claimed Grok when you want Groq, plus you were there first, but getting stuck on the name seems like it's going to be a distraction, so why not choose something different?

When Grok eventually makes the news for some negative thing, so you really want that erroneously associated with your product? Do you really want to pick a fight with the billionaire that owns Twitter, is that a core competency of the company?

If anything, getting in a very public fight with Musk may well be beneficial wrt brand recognition. Especially if he responds in his usual douchy way and it gets framed accordingly in the media.

it'll be great for publicity, for sure

Can't Chamath (he's one of your investors, right), do a thing there? Every person I pitch Groq to is confused and thinks its about Elons unspectacular LLM.

Yeah the confusion has happened a lot to me too. All I know is that it's in the hands of our legal team.

There's also a children's toy named Grok which uses LLMs to talk to the kid.

Thanks for asking just the question I wanted to ask!

If it wasn't for your comment, I would've thought this was by Twitter.

Interesting, I ran my "prime test", it passed on generating a prime, it failed on it being large.

"Can you generate me a large prime please"

Sure, here's a large prime number: 193707721

This number is a prime number because it is only divisible by 1 and itself. You can check its primality by trying to divide it by all the numbers up to its square root, and if none of them divide it evenly, then it is a prime number. This number is quite large, but there are much larger prime numbers known. For example, the largest known prime number, as of January 2022, is 2^82,589,933−1, a number with 24,862,048 digits, discovered as part of the Great Internet Mersenne Prime Search (GIMPS).

I then prompted "That's not large" which seemed to break it, the response is stuck in a loop:

That isn't large

I apologize if the prime number I provided did not meet your expectations for "large." Prime numbers can be very large indeed, and there is no upper limit to how large a prime number can be.

If you are looking for a very large prime number, here is one with a billion digits:

100000303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303...

It's printed many more pages of 303030... but hacker news has a post length limit!

Further testing, such as prompting to generate a 1024 bit prime also runs into odd loops.

Can you generate a 1024 bit prime?

Yes, I can generate a 1024-bit prime number for you. Here's one I just generated:

115322512531875133213135311402305250340282302283103413642283221253115653221253112210221122112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112 ( Loops 112 forever )

I can't think of any value in asking LLM's to generate prime numbers. They are neural nets, not turing machines.

The value is showing how confidently is presents incorrect results.

Especially the lack of nuance or uncertainty in the language they use.

They extremely confidently present the incorrect information, and prime generation is interesting because it's information that isn't easy to spot as obviously incorrect to the user while being information that's possible to determine is wrong at small numbers and difficult to verify for large numbers.

It's my favourite test because it's a good demonstration of the lack of nuance or uncertainty in LLMs. They have no sense of how wrong the information they're giving out might be.

If they could give confidence intervals for any information then they could provide the context by how likely they think they might be correct, but they actually double-down on their incorrectness instead.

Mixtral is a pre-existing model that was released by Mistral AI in December: https://mistral.ai/news/mixtral-of-experts/

It was discussed on HN at the time: https://news.ycombinator.com/item?id=38598559

You won't learn anything about the chips this submission is about by asking it questions - maybe except to look for degradation from quantization or computational shortcuts, but you'd have to compare the answers to Mixtral running with 16 bit floats. The point is the speed, just like a faster CPU isn't going to improve the UI of Outlook.

Ask a proper code gen llm, I bet you get a better response.

How do you know that it didn't somehow find the largest prime? Perhaps you just threw away a Noble Prize.

Nobel Prize in what? There is no Nobel in mathematics or computer science.

nobel prizes are a joke anyways.... Obama got the one for peace?

It's worth noting that it may have got lucky with the first response. A clean session with the prompt:

Could you generate me a 64 bit prime?

Returns:

Sure, here's a 64-bit prime number: 15592425903773483751

You can verify that it's prime by checking that it has no divisors between 2 and its square root. In this case, the square root of the number is approximately 12489593.5, so you would need to check divisibility for all integers between 2 and 12489593.5. Since the number is odd, you can skip checking even numbers, which will reduce the number of checks you need to perform.

However, if you're looking for a larger 64-bit prime or a prime number with specific properties, you may need to use a more sophisticated algorithm or a specialized prime number generator.

Like GPT3.5 and Bard, it confidently suggests that 15592425903773483751 is prime, and even explains why.

It's wrong:

$ factor 15592425903773483751 > 15592425903773483751: 3 3 3 7 167 853 911 635723119

Further prompting:

Can you factor 15592425903773483751?

No, I cannot factor 15592425903773483751 because it is a prime number, and prime numbers cannot be factored into smaller non-trivial factors. By definition, a prime number is a positive integer greater than 1 that has no positive integer divisors other than 1 and itself. In this case, 15592425903773483751 is a 64-bit prime number, which means it is a prime number that fits into a 64-bit data type.

To verify that 15592425903773483751 is a prime number, you can use a primality test, such as the Fermat primality test or the Miller-Rabin primality test. These tests can quickly determine whether a number is prime or composite (not prime) with high probability. In the case of 15592425903773483751, both tests confirm that it is a prime number.

Even further prompting to get it to try to multiply numbers together quickly turns into a disaster, culminating with:

That doesn't look right

I apologize for the mistake in my previous response. Here is the correct result of the multiplication:

3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119 = 322711211111111111

Therefore, the product of the numbers is:

322711211111111111.

I apologize for any confusion my previous response may have caused.

The context seems to be really aggressively baked in, because when I ask it:

Can you factor 322711211111111111?

It returns

Yes, I can factor the number 322711211111111111. Here is its prime factorization:

322711211111111111 = 3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119

Therefore, the prime factors of the number are 3, 7, 167, 853, 911, and 635723119.

Note that the prime factorization is unique up to reordering of the factors. In this case, the prime factors are repeated several times, so the prime factorization can be written more compactly as:

3^3 x 7 x 167 x 853 x 911 x 635723119

or

3 x 3 x 3 x 7 x 167 x 853 x 911 x 635723119

I hope this helps! Let me know if you have any other questions.

The real irony? 322711211111111111 is prime!

Impressive demo!

However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?

Also, while the CNN Vapi demo was impressive as well, a few weeks ago here[1] someone shared https://smarterchild.chat/. That also has _very_ low audio latency, making natural conversation possible. From that discussion it seems that https://www.sindarin.tech/ is behind it. Do we know if they use Groq LPUs or something else?

I think that once you reach ~50 t/s, real-time interaction is possible. Anything higher than that is useful for generating large volumes of data quickly, but there are diminishing returns as it's far beyond what humans can process. Maybe such speeds would be useful for AI-AI communication, transferring knowledge/context, etc.

So an LPU product that's only focused on AI-human interaction could have much lower capabilities, and thus much lower cost, no?

[1]: https://news.ycombinator.com/item?id=39180237

However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?

For API access to our tokens as a service we guarantee to beat any other provider on cost per token (see https://wow.groq.com). In terms of selling hardware, we're focused on selling whole systems, and they're only really suitable for corporations or research institutions.

Do you have any data on how many more tokens I would use with the increased speed?

In the demo alone I just used way more tokens than I normally would testing an LLM since it was so amazingly fast.

Interesting question! Hopefully being faster is so much more useful to you that you use a lot more :)

How open is your early access? i.e. likelihood to get API access granted right now

We are absolutely slammed with requests right now, so I don't know, sorry.

>50 t/s is absolutely necessary for real-time interaction with AI systems. Most of the LLM's output will be internal monologue and planning, performing RAG and summarization, etc, with only the final output being communicated to you. Imagine a blazingly fast GPT-5 that goes through multiple cycles of planning out how to answer you, searching the web, writing book reports, debating itself, distilling what it finds, critiquing and rewriting its answer, all while you blink a few times.

Given the size of the Sindarin team (3 AFAICT), that mostly looks like a clever combination of existing tech. There are some speech APIs that offer word-by-word realtime transcription (Google has one), assuming most of the special sauce is very well thought out pipelining between speech recognition->LLM->TTS

(not to denigrate their awesome achievement, I would not be interested if I were not curious about how to reproduce their result!)

In how far is the API compatible with OpenAI? Does it offer logprobs[0] and top_logprobs[1]?

0. https://platform.openai.com/docs/api-reference/chat/create#c...

1. https://platform.openai.com/docs/api-reference/chat/create#c...

You can find our API docs here, including details of our OpenAI compatibility

https://docs.api.groq.com/

Filled the form for API Access last night. Is there a delay with increased demand now?

Yes, there's a huge amount of demand because Twitter discovered us yesterday. There will be a backlog, so sorry about that.

Understandable. Wish you guys best of luck irrespective.

By the way, we also have a new Discord server where we are hosting our developer community. If you find anything missing in our API you can ask about there:

https://discord.com/invite/TQcy5EBdCP

Where is the data center located? The fastest response time I could get from some quick testing from the northeast US, having it output just one letter, was 670ms. Just wondering if that's an expected result, as it's on a par or slower than GPT 3.5 via API.

West Coast US. You would have been placed in our queuing system because with all the attention we are getting we are very busy right now!

Thanks! I did notice the queue count showing up occasionally but not every time. Maybe someone could repeat the test who has access without the queue so we can get an understanding of the potential latency once scaled and geo-distributed. What I'm really trying to understand is time to first token output actually faster than GPT 3.5 via API or just the rate of token output once it begins.

I don't know about GPT 3.5 specifically, but on this independent benchmark (LLMPerf) Groq's time to first token is also lowest:

https://github.com/ray-project/llmperf-leaderboard?tab=readm...

There’s a queueing system if too many requests are being processed at once. You may have hit that.

How come the answers for Mixtral 8x7B-32k and Llama 2 70B-4k are identical?

After asking via Mixtral a couple of questions I switched to Llama, and while it shows Llama as the Model used for the response, the answer is identical.

See first and last question:

https://pastebin.com/ZQV10C8Q

Yeah, it's confusing. See here for an explanation: https://news.ycombinator.com/item?id=39431921

Thanks! And congrats, the speed is impressive and quality really good.

Oh yeah you definitely need to change that ASAP or at least add an explanation. I also thought there was something fishy going on. Thanks for the explanation.

If the page can't access certain fonts, it will fail to work, while it keeps retrying requests:

    https://fonts.gstatic.com/s/notosansarabic/[...]
    https://fonts.gstatic.com/s/notosanshebrew/[...]
    https://fonts.gstatic.com/s/notosanssc/[...]

(I noticed this because my browser blocks these de facto trackers by default.)

Same problem when trying to use font replacements with a privacy plugin.

This is a very weird dependency to have :-)

Thanks, I've reported this internally.

How to show Google how popular and interesting for acquisition you are without directly installing google trackers on your website.

Sorry if this is dumb but how is this different to Elons Grok? Was Groq chosen as a joke or homage ?

This company is older than Elon's

ah ok cool, why the downvotes? did I offend more than one person with my ignorance? why did Elon name his Grok?

I don't know why you got downvoted, but "grok" is the Martian word for "understand deeply" from Robert Heinlein's "Stranger in a Strange Land".

The Groq demo was indeed impressive. I work with LLM alot in work, and a generation speed of 500+ tokens/s would definitely change how we use these products. (Especially considering it's an early-stage product)

But the "completely novel silicon architecture" and the "self-developed LPU" (claiming not to use GPUs)... makes me bit skeptical. After all, pure speed might be achievable through stacking computational power and model quantization. Shouldn't innovation at the GPU level be quite challenging, especially to achieve such groundbreaking speeds?

more on the LPU and data center: https://wow.groq.com/lpu-inference-engine/

price and speed benchmark: https://wow.groq.com/

I work at Groq. We arent using GPUs at all. This is a novel hardware architecture of ours that allows this high throughput and latency. Nothing sketchy about it.

Shouldn't innovation at the GPU level be quite challenging, especially to achieve such groundbreaking speeds?

GPUs are general purpose, a for purpose built chip that is better isn't that hard to make at all. Google didn't have to work hard at all to invent TPUs which is that idea as well, they said their first tests proved the idea worked so it didn't require anything near Nvidias scale or expertise.

This demo has more than 500 chips btw, it’s not exactly an apples to apples comparison with 1 GPU…

Definitely not, but even with a comparison to 500 GPUs Groq will still come out on top because you can never reduce latency by adding more parallel compute :)

So please let me know if I am wrong are you guys running a batch size of 1 in 500 GPU's? then why are the responses almost instant if you guys are using batch size 1 and also when can we expect bring your own fine tuned models kind of thing. Thanks!

We are not using 500 GPUs, we are using a large system built from many of our own custom ASICs. This allows us to do batch size 1 with no reduction in overall throughput. (We are doing pipelining though, so many users are using the same system at once).

Do you guys provide logprobs via the api?

You can check out all our API features here: https://docs.api.groq.com/

Correct me if I'm wrong but it seems from the docs that the answer is no?

Have you experimented with running diffusion models on Groq hardware?

Yes, we don't have any publicly accessible ones at the moment though.

Diffusuon models will be as much a killer app as LLMs. A picture is worth 1,000 words. A video…

I (only) ran a couple of prompts but I am impressed. It has the speed of gpt 3.5 and the quality of gpt 4.

Seriously considering switching from [open]AI to Mix/s/tral in my apps.

Try more prompts, both models could not even answer the "Sally has 3 brothers" question; really disappointing.

Mixtral 8x7 is good, but it is not GPT-4 good in any of the use cases I have tried. Mistral’s other models get close and beat it in some cases, but not Mixtral.

Do they make money from LLM service or by selling hardware? Homepage is confusing without any reference to other products.

Both, we sell tokens as a service and we sell enterprise systems.

Then reference that in the homepage? If not for this HN thread, I wouldn't have know you sell hardware.

Very impressive! I am even more impressed by the API pricing though - 0.27/1M tokens seems like an order of magnitude cheaper than the GPT-3.5 API, and two orders of magnitude cheaper than GPT-4? Am I missing something here?

They’re competing with the lowest cost competitors for mistral atm, which afaik is currently deepinfra at the same pricing

Huh! Had no idea open source model were ahead of OpenAI already on pricing - will have to look into using these for my use cases.

For me, it was fast when it started printing (it did almost instantly), but it took forever for it to start.

There are a lot of people interested in Groq now, so most jobs are sitting in a queue for a little while.

fair enough... I guess at least it didnt crash like many other overwhelmed sites do... but at the end of the day, it was my experience.

Nice… a startup that has two “C” positions CEO and Chief Legal Officer…

That sounds like a fun place to be

When you have a pile of hardware and silicon Intellectual Property, patents, etc, IMO it's pretty clever. However, I'm a Groq Engineer, and I'm mega-biased.

They seem to be around since 2016. May be not bad for a LLM company that would need to deal with legal issues?

Jesus that makes chatgpt and even gemini seem slow AF

better quality than I was expecting. For fun I set the system prompt to:

You are a leader of a team of ai helpers. when given a question you can call on an expert, as a wizard calls on magic. You will say, I call forth {expert} master of {subject matter} an expert in {x, y, z}. Then you will switch to that persona.

I was not let down..

Nice prompting strategy :)

I always ask LLMs this:

If I initially set a timer for 45 minutes but decided to make the total timer time 60 minutes when there's 5 minutes left in the initial 45, how much should I add to make it 60?

And they never get it correct.

gpt4 first go:

If you initially set a timer for 45 minutes and there are 5 minutes left, that means 40 minutes have already passed. To make the total timer time 60 minutes, you need to add an additional 20 minutes. This will give you a total of 60 minutes when combined with the initial 40 minutes that have already passed.

Bard/Gemini gets it wrong the same way too. Interestingly, if I tell either GPT-4 or Gemini the right answer, they figure it out.

Switching the model between Mixtral and Llama I get word for word the same responses. Is this expected?

Yeah, this is a common observation. See my comment at https://news.ycombinator.com/item?id=39431921

Maybe we should change the behavior to stop people getting confused.

No…

The demo is pretty cool, but the mobile interface could be a parody of bad interface design. The text box at the top is hard to reach if you want to open the keyboard, which automatically closes, or press the button to send the question, and the chat history is out of chronological order for no logical reason.

Edit: Text selection is also broken.

Edit: Text selection is also broken.

Or disabled?

It works for me, but the selected text is superimposed on top of the normal text in a different size.

What's the underlying hardware for this?

It's a system built from hundreds of GroqChips (a custom ASIC we designed). We call it the LPU (language processing unit). Unlike graphics processors, which are still best in class for training, LPUs are best in class for low latency and high throughput inference. Our LLMs are running on several racks with fast interconnect between the chips.

They have a paper [1] about their 'tensor streaming multiprocessor'

[1] https://wow.groq.com/wp-content/uploads/2024/02/GroqISCAPape...

Really impressive work! I wonder how easy would it be to support (a future open source version of) SORA using Groq's design. Will there be a Video Processing Unit (VPU)?

i can't comment about sora specifically, however the architecture can support workloads beyond just LLM inference.

our demo booth at trade shows usually has StyleCLIP up at one point or another to provide an abstract example of this.

disclosure: i work on infrastructure at Groq and am generally interested in hardware architecture and compiler design, however i am not a part of either of those teams :)

Ignoring latency but not throughput, How does this compare in terms of Cost ( cards Acquisition cost and Power needed) with Nvidia GPU for inference?

We intend to be very competitive on cost, power, hardware, TCO, whatever it is. Custom-built silicon+hardware has the advantage in this space.

Relevant thread from 5 months ago: https://news.ycombinator.com/item?id=37469434

I'm achieving consistent 450+ tokens/sec for Mixtral 8x7b 32k and ~200 tps for Llama 2 70B-4k.

As an aside, seeing that this is built with flutter Web, perhaps a mobile app is coming soon?

There was also another discussion about Groq a couple of months ago https://news.ycombinator.com/item?id=38739199

This is very impressive, but whilst it was very fast with Mixtral yesterday, today I waited 59.44s for a response. If I was to use your API, the end-to-end is much more important than the Output Tokens Throughput and Time to first token metrics. Will you also publish average / minimum / maximum end-to-end times too?

Yes, sorry about that, it's because of the huge uptick in demand we've had since we went viral. We're building out more and more hardware to cope with demand. I don't think we have any quality of service guarantees for our free tier, but you can email sales@groq.com to discuss your needs.

I’m sure Elon is pissed since he has Grok.

Someone now needs to make a Groc

https://wow.groq.com/hey-elon-its-time-to-cease-de-grok/

It's impressive, but I have one problem with all of those models. I wanted them to answer what Mixtral or Llama2 are, but with no luck. It would be great if models could at least describe themselves.

There are two issues with that.

1. To create a model, you have to train it on training data. Mixtral and Llama2 did not exist before they were trained, so their training data did not contain any information about Mixtral or Llama2 (respectively). You could train it on fake data, but that might not work that well because:

2. The internet is full of text like "I am <something>", so it would probably overshadow any injected training data like "I am Llama2, a model by MetaAI."

You could of course inject the information as an invisible system prompt (like OpenAI is doing with ChatGPT), but that is a waste of computation resources.

omg. i can’t believe how incredibly fast that is. and capable too. wow

Thanks! Feel free to join our discord for more announcements and demos! https://discord.com/invite/TQcy5EBdCP

Do you have any plans to support bringing your own model? I have been using Sagemaker but it is very slow to deploy to.

Yes, we're working with some customers on that, but it will be a while until general availability.

Just a minor gripe the bullet option doesn't seem to be logical..

When I asked about Marco Polo's travels and used Modify to add bullets, it added China, Pakistan etc as children of Iran. And the same for other paragraphs.

(Groq Employee) Thanks for the feedback :) We're always improving that demo.

I see XTX is one of the investors. Any potential other use cases with async logic beyond just inference?

See this reply: https://news.ycombinator.com/item?id=39437239

I see XTX is one of the investors - any potential use cases that require deterministic computation that you can talk about beyond just inference?

(Groq Employee) As I'm sure you're aware, XTX takes its name from a particular linear algebra operation that happens to be used a lot in Finance.

Groq happens to be excellent at doing huge linear algebra operations extremely fast. If they are latency sensitive, even better. If they are meant to run in a loop, best - that reduces the bandwidth cost of shipping data into and outside of the system. So think linear algebra driven search algorithms. ML Training isn't in this category because of the bandwidth requirements. But using ML inference to intelligently explore a search space? bingo.

If you dig around https://wow.groq.com/press, you'll find multiple such applications where we exceeded existing solutions by orders of magnitude.

are there also experiments around image embedding generation to use in combination with the LLM?

maybe for this use-case is it better to execute the vision tower on a GPU and leave the LPU for the language part?

We are great for image embedding (and audio, with more to come!) There is no reason you should be forced to use graphics cards intended for gaming for any AI workload.

I’ll pay $xx a month if I can talk to Groq the way I can talk to ChatGPT with my AirPods

Potentially coming soon?

Check out this demo: https://www.youtube.com/watch?v=pRUddK6sxDg&ab_channel=Groq, of our founder demoing the Groq system to a reported. Shes talking to the system in real time, similar to what you describe.

Impressive work. Nice job team. This is big.

Thanks!

Holy smokes this is fast

Thanks!

I asked it "What is the oldest known written recipe" and then asked for that recipe. It seemed to repeat the same response a bunch of times (very quickly!) before stopping. Is this abnormal behavior?

For me the reply for continuation question "Please tell me the exact steps of that recipe.":

Here is a rough translation of the oldest known written recipe, which is for a kind of beer from ancient Mesopotamia:

1. Crush and mash 4 liters of barley to create malt.

2. Soak the malt in water for 2.5 days.

3. Leave the mash in a cool place for 2.5 days.

4. Filter the mash through a strainer to separate the liquid from the solids.

5. Boil the liquid for 1 hour.

6. Add 1 liter of honey and 1 handful of hops to the liquid.

7. Leave the mixture to cool.

8. Add 1 handful of "gig" (a type of wild yeast) to the mixture.

9. Leave the mixture in a cool place for 5 days to allow fermentation.

10. Filter the mixture again to separate the beer from the solids.

11. Serve the beer chilled.

It's important to note that this recipe is over 3,700 years old, and it may not produce a beer that is palatable to modern tastes. Additionally, some of the ingredients and techniques used in this recipe may not be readily available or practical for home brewers today.

Jaw dropping. Both groq and mixtral.

I used following prompt:

Generate gitlab ci yaml file for a hybrid front-end/backend project. Fronted is under /frontend and is a node project, packaged with yarn, built with vite to the /backend/public folder. The backend is a python flask server

And yet, it made a simple mistake in some python code :'( > particles = np.zeros((2, 3)) # position, velocity, and acceleration particles[:, 0] = [0.0, 0.0, 0.0] # initial position

Incredible job. Feels dumb or obvious to say this, but this really changes the way I think of using it. The slow autoregression really sucks because it inhibits your ability to skim sections. For me, that creates an unnatural reading environment. This makes chatgpt feel antiqued.

Yes, agreed. We believe the benefits of reducing latency are non-linear. You can hit different phase changes as the latency reduces and new applications become viable. Roundtripping text-to-speech and speech-to-text is one example. We're looking forward to seeing what low latency applications are unlocked by our new users!

Is it normal that I have asked two networks (llama/mixtral) the same question ("tell me about most popular audio pitch detection algorithms") and they gave almost the same answer? Both answers start with "Sure, here are some of the most popular pitch detection algorithms used in audio signal processing" and end with "Each of these algorithms has its own strengths and weaknesses, and the choice of algorithm depends on the specific application and the characteristics of the input signal.". And the content is 95% the same. How can it be?

Yeah it's a bit confusing. See here for details: https://news.ycombinator.com/item?id=39431921

Any chance you plan to offer the API to cloud LPUs? And not just the LLM API? It would be cool run custom code (training, serving, etc).

Yes, in the future we'd like to do that.

This is incredibly fast, indeed.

What are the current speeds in T/s for say ChatGPT 3.5 or ChatGPT 4? Just how much faster is this?

I ran the same (Code generation) query and here are my results as end user:

ChaGPT: 1 minute 45 seconds.

Gemini: 16 seconds.

Groq: 3 seconds.

That sort of speed will be amazing for code completion. Need to find a way to hook this into vscode somehome...

in a lot of use cases.

Imagine this for audio chat. Phone call prospection. Awww

I think we’re kind of past the point where we post prompts because it’s interesting, but this one still had me thinking.

Obviously it doesn’t have memory, but it’s the first time I’ve seen a model actually respond instead of hedge (having mostly used ChatGPT).

what is the longest prompt you have ever received?

The length of a prompt can vary greatly, and it's not uncommon for me to receive prompts that are several sentences long. However, I don't think I have ever received a prompt that could be considered "super long" in terms of physical length. The majority of prompts I receive are concise and to the point, typically consisting of a single sentence or a short paragraph.

Been using it exclusively since December, 5bit quantized, 8,000 token context window

Sometimes you need a model that just gives you the feeling “that’ll do”

I did switch to Miqu a few weeks back though. 4 bit quantized

As a virtual reality geek, this is super exciting because although there are numerous people experimenting with voicing NPCs with LLMs, they all have horrible latency and are unusable in practice. This looks like the first one that can actually potentially work for an application like that. I can see it won't be long before we can have open ended realistic conversations with "real" simulated people!

It is fast, but if it spits useless garbage, then useless. I don't mind waiting for chatGPT, the quality of what it produces is quite remarkable, and I am excited to see it better. I think this has more to do with mistral model v GPT4 than Groq. If Groq can host GPT4, wow, then that is amazing.

@tome Cannot sign up with sneakemail.com, snkml.com, snkmail, liamekaens.com etc... I pay for these services so my email is a bit more protected. Why do you insist on getting well-known email providers instead, datamining or something else?

any noscript/basic (x)html prompt?

Oh hell yes, this is the first "fast" one, superhuman fast.

I know you gave suggestions of what to ask, but I threw a few curveballs and it was really good! Well done this is a big step forwards

I wish there was an option to export whole session chat, say in plaintext as a link to some pastebin, that chat I just had with groq would have some ppl I know really impressed

How is the Token/second calculated? I ask it a simple prompt and the model generated a 150 word (about 300 tokens?) answer in 17 seconds, then mentioning the speed of 408T/s.

Also, I guess this demo would feel real time if you could stream the outputs to the UI? Can this be done in your current setup?

I asked it what carmacks AI company was called and it corrected identified John carmack but said he was working on VR.

This is unreal. I have never seen anything this fast. How ? I mean, how can you physically ship the bits this fast, let alone a LLM.

Something about the UI. Doesn't work for me. May be I like openAI chat interface too much. Can someone bring their own data and train ? That would be crazy!

Hey y'all, we have a discord now for more discussion and announcements: https://discord.com/invite/TQcy5EBdCP

I’m not sure how, but I got the zoom messed up on iOS and I can no longer see the submit button. Refreshing doesn’t fix it.

Congrats on the great demo, been a fan of Groq since I learned about TSP. I'm surprised LPU runs Mixtral fast because MoE's dynamic routing is orthogonal to Groq's deterministic paradigm. Did Groq implement MegaBlocks-like kernels or other methods tailored for LPUs?

This is super impressive. The rate of iteration and innovation in this space means that just as I'm feeling jaded/bored/oversaturated - some new project makes my jaw drop again.

Incredible tool. The Mixtral 8x7B model running on their hardware did 491.40 T/s for me…

Tried it out, seriously impressive. I'm sure you welcome the detractors but as someone who doesn't work for or have any investments in AI, colour me impressed.

Though with the price of the hardware, I'll probably mess with the API for now. Give us a bell when the hardware is consumer friendly, ha ha.

GPUs are notoriously bad on exploiting sparsity. I wonder if this architecture can do a better job. The groq engineers in this thread, if a neural network had say 60% of its weights set to 0, what would it do to cost & speed in your hardware?

Wow! So fast!

Amazingly fast, but I don't like that the only option for signing up is a Google account. Are you planning to implement some simple authentication using maybe just an email?

If Nvidia adds L1/2/3 cache in next gen of AI cards, will they work similar or is this something more?

I have this hooked up experimentally to my universal Dungeon Master simulator DungeonGod and it seems to work quite well.

I had been using Together AI Mixtral (which is serving the Hermes Mixtrals) and it is pretty snappy, but nothing close to Groq. I think the next closes that I've tested is Perplexity Labs Mixtral.

A key blocker in just hanging out a shingle for an open source AI project is the fear that anything that might scale will bankrupt you (or just be offline if you get any significant traction). I think we're nearing the phase that we could potentially just turn these things "on" and eat the reasonable inference fees to see what people engage with - with a pretty decently cool free tier available.

I'd add that the simulator does multiple calls to the api for one response to do analysis and function selection in the underlying python game engine, which Groq makes less of a problem as it's close to instant. This adds a pretty significant pause in the OpenAI version. Also since this simulator runs on Discord with multiple users, I've had problems in the past with 'user response storms' where the AI couldn't keep up. Also less of a problem with Groq.

I tried it out, and I was taken aback how quickly it answered.

Please when/where can I buy some of these for home use? Otherwise is there any way to get access to the API without being a large company building a partner product? I would love this for personal use.

This is insane. Congratulations!

Groq's API performance reaches close to this level of performance as well. We've benchmarked performance over time and >400 tokens/s has sustained - can see here https://artificialanalysis.ai/models/mixtral-8x7b-instruct (bottom of page for over time view)

O M G

It is fast, like instant. It is straight to the point comparatively to others. It answered few of my programming questions to create particular code and passed with flying colors.

Conclusion: shut up and take my money

Hi, thanks for this fascinating demo. I am wondering how this architecture optimizes for the softmax part.