return to table of content

Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x

doctor_eval
10 replies
20h28m

Our research shows this process – mimicking human cognitive process of forming complete sentences in mind before articulating word by word

This is not how I work. Is there something wrong with me?

jerbear4328
1 replies
20h19m

Nor is it how I work, I think that's normal enough. I do have an idea of what I'm going to say before I say it, I think that's closer to what they meant. I think and speak in increments of ideas, not words.

Filligree
1 replies
20h15m

You might not have an internal monologue. A lot of us don't, and the ones that do are equally shocked every time they find out. For what it's worth, I'm in the same boat—can form sentences, but why would I? It'd slow me down.

People who don't have inner monologues tend to assume that all that stuff is some form of analogy or metaphor. It's not. It's entirely literal.

oceanplexian
0 replies
20h7m

Do you mean in a real time conversation?

Because I definitely dont "have an internal monologue about what I'm going to say" in the 100ms between when someone asks a casual question and I respond to it.

throwawaymaths
0 replies
17h53m

Are you sure. It might not be the whole sentence, but I would find it hard to believe that in practice the way you speak or write is like

hello <think> May <think> be <think> I'll <think> go <think> get <think> break <think> fast

snyhlxde
0 replies
18h11m

In some conversations, maybe it's easier to form complete sentences. In some others, the best we can do is: have a rough draft about what to say in mind and then refine it word by word while speaking.

mdp2021
0 replies
20h8m

"Rem tene, verba sequentur" (you hold the matter, then words come) is largely "how it works".

You form logical ideas as you speak, as you speak your speech develops, so the translation is from ideas to sentences. It is not clear in which phase one would mentally form a complete sentence, nor why it should be relevant. You "see something [that makes sense]", then you describe it - iteratively.

giardini
0 replies
19h46m

Probably.

causal
0 replies
2h41m

You are probably pretty far from the LLM extreme, though, of thinking one token at a time.

DrSiemer
0 replies
20h13m

They probably do not mean people form entire sentences before expressing them, I am not aware of anybody doing that. I assume it refers to people first coming up with a global outline of what they want to say before they start speaking.

DoctorOetker
10 replies
19h50m

This mirrors what I experienced when I enrolled in "free drawing" (no teaching) classes:

While people considered me a good drawer since I was a child, I remember just repeating either similar detailed drawings I drew before, or otherwise just taking plenty of time to draw. I believe anyone with time and patience can make a nice drawing of a scene.

The "free drawing" class had no rules or lectures: you brought the materials you wanted to work with (some brought ink, others pencils, while I brought charcoal). The only thing determined was the timing between poses for the model: for each session the first few poses were very short (say a minute), and then the pose durations would progressively lengthen until say 5 minute poses. At all times you were free to tear your picture up and retry drawing the pose again.

My drawing skills improved considerably. The short "warmups" actually force you to get proportions and outlines correct on the first tries. Conventional wisdom says haste makes waste, but when learning or refining skills, it seems natural selection has hardcoded the sensation of haste as a stressor prompting attention and learning.

I am convinced I could have drawn similar quality drawings before enrolling in those classes, except they would have taken me easily 5 or 10 x as long to draw. Being forced not to beat around the bush and feeling the penalty of making a hasty mistake (further decreasing time left for the second try in the remaining time) does seem to work.

My only gripe is that the technique is termed "Consistency" whereas I would reserve such a term for an improvement in performance not inference speed, although I understand that they indicate "consistency with what would ultimately have been generated one token at a time". I would rather dub it "Proficiency LLM", where the same output is expected, only without the inhibition of stuttering to the same conclusion.

snyhlxde
5 replies
18h56m

Hi we are CLLM authors and thanks for sharing your experience and insights! I can see this drawing skill refining process echoes with the training process in CLLM, the only thing is at this point stressor in CLLM training is not getting progressively demanding.

For example, while drawing, you can set very specific time limit on how long you are allowed to draw in each trial and make the time progressively shorter. In CLLM, maybe we can make this the learning process more and more difficult by mapping more and more distant states in Jacobi trajectory to its final state.

We are using the term "consistency" because we draw parallelism between consistency LLM and the consistency model in diffusion image generation where the training processes are analogous.

Quarrel
2 replies
18h40m

Is it just me, or does this read like it was written by an LLM ... ?!

snyhlxde
0 replies
18h30m

lol I take that as a compliment. Good try but sadly no LLM in this writing :)

jasonjmcghee
0 replies
2h30m

It's just much more formal than people generally speak on HN.

boroboro4
1 replies
15h36m

Do you use same dataset to train / eval the model? Was the model used for example trained on GSM8K dataset for example?

snyhlxde
0 replies
15h32m

Yes, we consider both domain-specific applications (spider for text2SQL, gsm8k for math, codesearchnet for python) as well as open-domain conversational applications (ShareGPT). We use test set from each application to evaluate CLLMs’ performance in our paper.

On the other hand, technically CLLM works on any kind of queries. But the speedup might vary. Feel free to try out our codebase for your use cases!

manmal
1 replies
19h37m

Systems generally become more efficient when under stress. They are also forced into local optima - everything has upsides and downsides.

sheepscreek
0 replies
17h19m

Interestingly - this is the idea behind Nassim Taleb’s book “Antifragile” and the concept of “anti-fragility”.

In essence, it promotes dynamic/evolutionary/always learning behaviour than performing the same set of steps every time, and in the process, becoming stronger than before.

An example he shares is: how the breakdown of muscle tissue through exercise leads to more muscle development and an increase in strength. I guess it’s similar to LLM training using error/loss reducing functions (practice makes perfect) but dissimilar in the sense that training is a one—time action.

aamargulies
1 replies
18h56m

I had an interesting experience in an Invertebrate Zoology lab class one summer.

We students were brought into a lab, given specimens to draw, and the only instructions we received were 'You have 30 minutes to draw this. Go.'

There was no "here's how to draw. here's what to do and not to do". It was just basically "We don't care about any insecurities you might have. We don't care if you think you can't draw. No excuses, just fucking draw it. Now."

Not only did we draw, but we (all of us) improved enormously over the course of the class as more animals were brought in and the exercise was repeated over and over and over again throughout the summer.

What it taught us is that everyone, and I mean everyone, can draw. Our collective attitude shifted from "don't know if this is even possible" to "of course we can do this. this is easy. routine. trivial."

Highly recommended approach.

It was the most freeing and amazing class I had in college.

Version467
0 replies
13h33m

That sounds like a pretty awesome experience. Thanks for sharing.

wangii
7 replies
12h39m

I feel it's a pretty dangerous optimization before we REALLY understand what's going on inside of the LLM. e.g. guys believe in the geometric interpretation will have something to say, and it would probably hurt if you are using "filler" tokens.

Besides, the assumption (not a universal fact) that "forming complete sentences in mind before articulating word by word" seems overly simplifies activities happens in our mind: do we really have a complete planning before start talking/typing? as a Buddhist I lean towards it's an illusion. further more, what about simultaneous thoughts? are we linear thinker in the sentence level?

anyway, pretty neat math!

renonce
4 replies
11h56m

The optimization does not affect the result of LLM, it's guaranteed to produce equivalent results as decoding directly. Let's not treat that LLM as some magic that resembles our mind, it's just another program that produces sentences that happens to make sense.

sigmoid10
1 replies
11h40m

Lets not treat our mind as something magical. It's just another program that learned to speak by consuming lots of training input. The implementation might look slightly different from the outside, but from a mathematical perspective, artificial neural networks are proven to be at least as capable as the human mind.

baq
0 replies
11h36m

The best part is, your comment works both when sarcastic and completely serious.

wangii
0 replies
9h55m

According to the original Jacobi decoding paper, it's set in the machine translation tasks, with encoder + decoder, in which parallel algo applied only to the decoder part.

naasking
0 replies
3h21m

Let's not treat that LLM as some magic that resembles our mind,it's just another program that produces sentences that happens to make sense.

"That happen to make sense" is hiding a lot of magic. It would be statistically impossible to make as much sense as LLMs do in response to prompts if it did not actually make semantic distinctions. If it makes semantic distinctions, then it does resemble the human mind in at least one way.

causal
0 replies
2h41m

What is the geometric interpretation?

Etheryte
0 replies
12h26m

That assumption might be useful in this context, but I think it's pretty clearly not true. Ask anyone to tell you about a complex past event with a lot of parallel branches and you'll quickly see them add bits, pieces and tangents midsentence to cover the full range of events. I don't think I've seen the sentence granularity hypothesis in any serious scientific context before.

JKCalhoun
6 replies
18h45m

Anyone know somewhere someone dumb like me can "Ask an AI expert"?

I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?

I guess I want to learn this stuff and should maybe follow one of those "write an LLM in an hour" type videos on YouTube.

zipfcharge
1 replies
18h35m

It's because an LLM is essentially a probability matrix. You type a prompt, then it calculates what's the probability of getting a next word and so on, eventually forming a sentence. The probability learned is based on the training data.

Because of the underlying probability model, it's not going to be 100% deterministic. Plus a model like ChatGPT purposefully have "temperature" parameter that will further add randomisation to the whole process.

My answer is based on this paper if you're interested to read more: The Matrix: A Bayesian learning model for LLMs, https://arxiv.org/abs/2402.03175

flopriore
0 replies
13h20m

Are there any ways to show the source of the information retrieved by the model? For instance, the LLM forms a sentence and it points to a stackoverflow answer with the same or similar content.

zozbot234
0 replies
18h40m

I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?

You can control that in most systems with an inference-set parameter called "temperature". But setting the temperature as low as possible tends to lead to very low-quality answers - the system can't crawl out of some local optimum and ends up repeating itself over and over. Such answers may be "deterministic" but they're also not good.

throwawaymaths
0 replies
17h50m

how is it that an LLM when given the same prompt does not respond in the same deterministic way?

In software (not in the model) here's literally a random number generator that picks from a weighted set of "next-token" choices that the model spits out. The selection process can have a series of knobs to manipulate the responses. If you want it to be deterministic (if you have direct access to the software) you can tell it to set "top-k = 1" or "temperature = 0.0" (depending on your software) and it will be deterministic.

Usually the default settings are not for determinism, because for whatever reason the quality of the results tends to not be that good when you go fully d.

rahimnathwani
0 replies
18h38m

For this particular question, ask chatgpt how temperature affects llm softmax sampling.

For other things, study using Karpathy's videos.

8note
0 replies
18h41m

For that answer, you can refer to the 3blue1brown videos

The llm model outputs a vector of probabilities for tokens, and the llm user picks a token from the most likely list using a random number

dvt
5 replies
20h7m

There's no free lunch™, so from what I can tell there's some pathway loss here. E.g. some Jacobi trajectories definitionally exclude higher temperature paths. Which might actually be a positive given data retrieval (but a negative if we want to maximize for creativity?).

wrsh07
4 replies
19h2m

There are better and worse algorithms. I'm not sure "there is no free lunch" always applies in a particularly meaningful way. Some things aren't on the pareto frontier.

factormeta
3 replies
18h29m

Kinda like the aiff -> mp3 conversion process. A lot of data is lost, but we human can really tell the too much of a difference?

wrsh07
2 replies
17h22m

There's no reason to think the current next token prediction models are optimal for predicting sentences (they aren't!)

An algorithm may outperform another on a problem when neither is specialized to the problem

https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and_...

stkdump
1 replies
3h29m

I would go even further and say there isn't any indication that we are even close to what is possible. My subjective feeling is that with the current rate of progress it is entirely possible that we will have GPT-4 level performance locally on smartphone hardware within 3-10 years (unless companies decide again that they don't want to give this kind of power away)

naasking
0 replies
1h22m

Probably. Advancements in ML algorithms, like this one, have been outpacing advancements in hardware for awhile now, so both are converging on making ML faster and ubiquitous.

paulclark
4 replies
20h47m

Is this how Groq (https://groq.com/) is so fast, or are they doing something different?

buildbot
1 replies
20h36m

Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.

gdiamos
0 replies
10h14m

I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:

https://proceedings.mlr.press/v48/diamos16.pdf

Or better yet - on Cerebras

Kudos to groq for writing that kernel

wrsh07
0 replies
18h59m

My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.

(Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)

throwawaymaths
0 replies
17h57m

According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".

nico
4 replies
19h58m

Interesting

I think soon we are going to realize that we don’t really need training the models

We just need good indexing and sampling

Essentially at some level any LLM is equivalent to a DB of the dataset, with a great NLP interface on top

Both are just different methods of navigating stored data

tempusalaria
1 replies
18h11m

LLMs can easily produce data not in training dataset.

LLMs do not navigate stored data. An LLM is not a DB of the training data.

carlthome
0 replies
8h57m

I've had the same thought as above but unfounded (just a feeling, pretty much) so I'm curious to learn more. Do you have any references I can check out that supports these claims?

sdrg822
0 replies
19h25m

But indexing *is* training. It's just not using end-to-end gradient descent.

alfalfasprout
4 replies
21h23m

Wow, I'm mindblown this isn't getting more attention. This seems like a clear win for inference. Fine tuning cost for this is reasonable (around 0.01% of the original pre-training cost). And the performance wins seem fairly consistent.

lopuhin
1 replies
19h46m

Similar or greater inference wins are achieved with speculative decoding which is already widely used, so while this is really interesting (and was tried before with less success AFAIK), it's not yet clear how impactful it would be.

WhitneyLand
0 replies
2h54m

I don’t see where similar wins have ever been achieved.

Speculative decoding can reduce latency, but at the cost of using a lot more compute. The amazing thing here is latency and global throughput improvements would be realized because of the increase in efficiency.

From what I understand speculative decoding can also come with more challenges insofar as trying to maintain overall output quality.

snyhlxde
0 replies
18h38m

Thanks for interesting in our work! Yes we found training with consistency loss + AR loss on even a subset of a dataset results in significant speedup (0.01% pre-training cost). Training on more data permits even further speedup: the model is able to learn from more frequently-appearing collocations and phrases.

For more details, please check out our paper and you can also see speedup saturates as the size of training data grows.

WhitneyLand
0 replies
2h27m

Yes, seems like a huge important result for LLM performance.

I’m not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?

At least while also:

- Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.

- Improving not just query latency but also global throughput

- Not requiring more compute

- Having a relatively practical implementation and not adding big challenges and complexity

You could argue the insight is incremental, as it builds on what’s been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.

renonce
3 replies
13h19m

... speculative decoding methods ... incurs extra memory cost during inference time.

Any detail on this? For speculative decoding you need a smaller model to generate "branches" which are fast but maybe inaccurate and verify these branches later with a larger model. However, only memory equivalent to a single token is needed for speculative decoding, and tokens in other branches are simply masked out during inference. With a context size of 1000 and ~30 branches for 5 tokens, the memory overhead would be 3% which is negligible. If your context size is much smaller compared to the number of branches - would someone who use a generative LLM with a context window of just 50 tokens care about generation speed?

Also, speculative decoding techniques are not restricted to greedy sampling - it's expected to behave exactly the same as the original model and sample with the expected probabilities. Most literature on speculative decoding already reports 2.6x-3.5x speedup. The blog post here reports 2.4x-3.4x generation speed - which isn't that much of an upgrade?

While I mentioned speculative decoding above and Medusa2 and Eagle seems to be the techniques that the author compares against, the core problem remains: whatever method you use to predict tokens ahead of time, there is a specific point where the previous tokens are absolutely needed before predicting the next token. It doesn't depend on what your model is or what your techniques are, it's just about what is mathematically achievable. How can you predict 5 tokens at once if the probability distribution of the 5th next token depends heavily on the previous 4 tokens? Speculative decoding, Jacobi decoding, multi-token parallel decoding, whatever.

If only greedy sampling is supported for this, then I wonder what are the advantages of this method, not to mention that other techniques already achieve the expected speedup. Comparing greedy sampling speedups to random sampling speedups is comparing apples to oranges, and I doubt if the speedup described by the method would remain after this method is adapted to random sampling (due to the core problem mentioned above).

Palmik
1 replies
12h45m

Speculative decoding requires you to load the smaller model into memory and run inference on it.

renonce
0 replies
12h24m

I think the smaller model is at least 20 times smaller. If you do speculative decoding on a 70B model an 1B model would be appropriate.

cxczz
0 replies
5h11m

`the previous tokens are absolutely needed before predicting the next token'

Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?

On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.

ec109685
3 replies
19h46m

Could someone please explain the intuition around this technique in more lament terms?

TomatoCo
1 replies
19h25m

For all of these "how can we batch predicting the next n tokens?" the intuition is basically that it takes a buttload of math to predict some of the tokens, but that most tokens are actually easy to guess. For example, if I asked "What was that phone number from that 80's song?" as soon as a model generates 867- it shouldn't take that much math at all to finish predicting 5309.

snyhlxde
0 replies
18h20m

A bit more intuition on how training works: in natural language processing, some phrases/collocations, for example "remind ... of ...", "make a decision", "learn a skill" etc. are used together. We can ask LLMs to learn such collections & frequently appearing n-grams. After learning, the model can use parallel decoding to predict many tokens that are frequently appear together in one forward pass.

programjames
0 replies
6h11m

"Try to fix all the words in a sentence at once. Keep iterating until you don't think it needs fixing."

rcarmo
2 replies
20h28m

Can't wait to see something like this merged into ollama (I'm sure there would be plenty of people fine-tuning models for it).

helloericsf
0 replies
19h47m

The lab is tied to the vLLM project. I would say it might get picked up sooner by vLLM than other inference frameworks.

Me1000
0 replies
19h58m

Ollama doesn't have their own inference engine, they just wrap llama.cpp. But yes, it will be awesome when it's more generally available.

miven
2 replies
20h34m

The authors mention that Jacobi decoding is equivalent to greedy autoregressive decoding, but in practice don't we often want the sampling temperature to be above zero to avoid repetitions and excessively generic responses?

I'm completely unfamiliar with this decoding strategy so maybe I'm just missing a simple way to account for that.

snyhlxde
0 replies
18h48m

Yes this is a great question! We are actively working on supporting other sampling strategies other than greedy sampling. In the context of CLLM training, instead of mapping to a static fixed point obtained from Jacobi decoding as the training ojbective, we term it dynamic fixed point. You can keep an eye on our github repo for new progress.

matheist
0 replies
19h58m

Agreed. It's straightforward to check that a token was the argmax, but it seems difficult to check that a token appeared with the probability you wanted it to. You could still do the fine-tuning step I guess, where you train the trajectories to approach n-token completions with the statistics you want, but I can't see how you can replace the "check for a fixed point" step. Maybe "check the result was above this fixed threshold for likelihood".

toxik
1 replies
21h41m

Interesting stuff. I guess the idea has occurred to many but was well written and presented.

programjames
0 replies
6h9m

Yep. My roommate and I were talking about this a year ago. You can also do something similar for LLM steering.

fermuch
1 replies
21h38m

Would something like this apply to MAMBA/JAMBA too?

wrsh07
0 replies
18h54m

I think any next token predictor will benefit. Iiuc mamba is a next token predictor.

I just skimmed the gradient article, but if their only change is swapping out the transformer block for the mamba block, I don't think it's already using this optimization

andy12_
1 replies
21h40m

At first I thoght that this was another Medusa-like paper, simply using more unembed heads for guessing subsequent tokes, but damn, not at all. This is amazing. And it doesn't even use extra parameters, it's just an auxiliary training loss.

snyhlxde
0 replies
17h45m

The only similarity between Medusa and CLLM is both train and adapt LLMs for fast inference. But they use completely different training technique, decoding technique and as you pointed out CLLMs don't need extra parameters or configuring attention mask for tree-based verification.

snyhlxde
0 replies
18h32m

from CLLM authors:

Thank you guys for the great questions and insights! We have made a Twitter posts with some more details and we invite you to engage with us on Twitter as well.

https://twitter.com/haoailab/status/1788269848788869299

programjames
0 replies
6h14m

Surprisingly, we find such an objective is analogous to that of consistency models

This is why numerical methods should be part of the ML curriculum.

m3kw9
0 replies
19h5m

They can quickly try with one of the open source models, then show a side by side demo