HN comments for: Jamba: Production-grade Mamba-based AI model

skybrian

16 replies

2024-03-28 18:11:11 UTC

Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.

I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.

electric_mayhem

6 replies

2024-03-28 18:14:54 UTC

It’s literally simulating a neural network.

How much of your 5-sense experiential memories and decades of academic book learning are you bringing to understand my reply to your post?

How many gigabytes do you think that’s equivalent to?

richardw

2 replies

1d19h

2024-03-28 23:19:25 UTC

It’s kinda simulating our brains but not really. When I attempted to dig more into how neurons work I realised that it’s a massive chasm of difference. Very much worth doing if you haven’t (you might know far better then me, this is for people who don’t yet.)

In terms of results: Our brains are working with 20w of power and can be trained to compete with LLM’s using a tiny fraction of the world’s data. They also have to keep you breathing and your blood pumping and manage all the dangers of catching a ball near traffic. Or skiing, or poetry, or sunsets. And they remember stuff five minutes later and don’t need a training run that takes months.

We have SO many opportunities to improve the AI architecture it’s ridiculous. This is a good thing.

reissbaker

0 replies

1d18h

2024-03-28 23:51:16 UTC

To be fair most of the brain is more like a pretrained model — it isn't being trained at any point after conception to keep your blood pumping or your lungs working, it does that out of the box roughly as soon as you sprout those organs (or the minute you're born, in the case of lungs). The training process was billions of years of evolution. And, well, given fairly persistent cross-cultural cognitive biases, I expect the conscious thought parts are starting from a pretrained model, too, and all we're doing in school is finetuning ;)

imtringued

0 replies

1d10h

2024-03-29 07:37:12 UTC

People don't understand that to simulate a single neuron, you need an entire neural network. So 70 billion parameters might at best be equivalent to a million neurons but that is assuming that your neural network architecture is akin to the connections between neurons. Considering the physical sparsity, you might need even more parameters to model the connections of a biological neural network. So less than a million neurons in practice.

skybrian

1 replies

1d23h

2024-03-28 18:45:39 UTC

Jamba seems to be distributed as 21 5-gigabyte files [1] so I guess that’s another way of looking at it.

[1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main

imtringued

0 replies

1d10h

2024-03-29 07:32:22 UTC

So what? I have seen models distributed as 26x 10GB files.

_false

0 replies

1d23h

2024-03-28 18:39:06 UTC

I love both parent post perspectives on this.

pama

3 replies

1d14h

2024-03-29 04:01:32 UTC

The big (huge?) memory requirement is during training. These LLMs work with high dimensional vectors and they calculate gradients with respect to high dimensional vectors and they do updates that require state of the optimizer. If you have 3 particles in 3 dimensions and you need their forces that creates 3 new 3D vectors and once you update their position along the forces then they also carry momenta. Now generalize these simple 3-body physics to the typical 60-layer creatures inside the LLM with vectors of several thousand dimensions, interactions/weights that are scaling like the squares of these vectors, to a total parameter count that adds up to the 10s to 100s of billions of parameters, and then take derivatives and start to keep track of momenta. It is a feat of modern engineering that some groups can train such models efficiently. I hope we will see more of the training stories becoming public in the near future.

2 replies

1d10h

2024-03-29 07:53:52 UTC

This is wrong. You need big memory during inference too.

The difference there is you can use tricks like quantisation and offloading to CPU to reduce it somewhat at the cost of accuracy and/or speed.

pama

0 replies

16h29m

2024-03-30 01:56:23 UTC

Not sure what you mean by wrong. I have never encountered a case yet when training an LLM (no matter what architecture) would require limited memory and was pointing out that the typical memory requirements for training are much higher yet than the typical requirements for inference.

brrrrrm

0 replies

1d4h

2024-03-29 13:45:07 UTC

Training is 3x the memory used by inference, and usually run at a much larger batch size

nostrowski

1 replies

1d21h

2024-03-28 21:10:16 UTC

Two things I'm curious to know:

1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU? 2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?

patrakov

0 replies

1d21h

2024-03-28 21:20:52 UTC

Mixtral 8x7b runs well (i.e., produces the correct output faster than I can read it) on a modern AMD or Intel laptop without any use of a GPU - provided that you have enough RAM and CPU cores. 32 GB of RAM and 16 hyperthreads are enough with 4-bit quantization if you don't ask too much in terms of context.

P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.

riku_iki

0 replies

1d20h

2024-03-28 21:57:06 UTC

that you need 80GB of GPU memory to analyze less than 1 megabyte of data

80GB is compressed all human knowledge applied on that 1mb..

0 replies

1d10h

2024-03-29 07:56:07 UTC

That’s all the world’s knowledge compressed into 80GB. It’s not analysing 1MB data, it’s analysing all of that knowledge plus and additional 1MB.

imtringued

0 replies

1d10h

2024-03-29 07:30:38 UTC

Compared to the human brain they are shockingly efficient. It's the hardware that isn't, but that is just a matter of time.

Reubend

14 replies

2024-03-28 18:05:23 UTC

It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.

refulgentis

10 replies

2024-03-28 18:16:09 UTC

I would too -- long context has been such a red herring across providers, Claude 3 is the first I've seen that seems to genuinely have some sort of qualitative leap in noticing things.

It is worth noting I'm fairly sure there's no inherent theoratical decrease to accuracy in long contexts, the claimed theoratical change is an _increase_ in long-term accuracy in long contexts.

Arthur_ODC

5 replies

1d23h

2024-03-28 19:12:50 UTC

Long Context is great and all, but it sucks that all of these LLM's have really poor output length. If I feed something an entire book and ask for a comprehensive summary then I'm expecting at least a full 3-page summary. I get that they try to force these things to be "concise" to save on compute, but good lord it's so annoying.

pedrovhb

2 replies

1d20h

2024-03-28 21:47:32 UTC

Have you tried asking it for a specific concrete length, like a number of words? I was also frustrated with concise answers when asking for long ones, but I found that the outputs improved significantly if I asked for e.g. 4000 words specifically. Further than that, have it break it down into sections and write X words per section.

Arthur_ODC

1 replies

1d20h

2024-03-28 22:07:47 UTC

Yes, all the possible length extending custom instructions you can think of. I can get some reasonable length responses out of it, but I've never seen them go over 1 page worth, and multi-shot example prompts using multiple USER and GPT exchanges to define the format. Seems like GPT4 has a hard limit as to how much it will output when you click "continue", and Claude Opus never goes over a page either. Another user pointed out using the API, which I have done in the past, but it's been a long while, and I can't really justify the cost of using the advanced models via API for my general use.

refulgentis

0 replies

1d19h

2024-03-28 23:08:48 UTC

Everyone's coalescing at a max of 4096 tokens/12 "pages" via API (page is 250 words, which is 1 8.5"x11" double spaced)

To your point, doesn't matter anyway, it's nigh impossible to get over 2K of output with every trick and bit of guidance you can think of (I got desperate when 16K/48 pages came out to "make it work", even completely deforming tricks like making it number each line and write a reminder on each line that it should write 1000 lines don't work)

CuriouslyC

1 replies

1d23h

2024-03-28 19:23:56 UTC

That's a chat gpt problem, if you hit the API it's not nearly so hard to get good output.

refulgentis

0 replies

1d22h

2024-03-28 19:58:32 UTC

I wouldn't say that, my latest big user story for making sure I'm handling huge inputs was "translate Moby dick to zoomer". Cant give any service chunks larger than ~5K tokens, over API, without it failing.

(Miserably, like, I'd be fine if it gave a paragraph back. But at least on this "map" task, there's a critical point where there's so much input that the reward function ends up imitating the input more instead of chatting)

tempusalaria

2 replies

1d22h

2024-03-28 20:02:52 UTC

Every long context sucks right now. All the model providers benchmark on fact recall which is very limited. Actual ability to do anything complicated beyond 16k tokens is not present in any current model I have seen.

ukuina

1 replies

1d17h

2024-03-29 00:26:03 UTC

This is not current. GPT-4-Turbo (128k) has lossless recall to the first 64k input tokens and produces output indistinguishable from GPT-4 (32k), though both are limited to 4k output tokens.

Several downsides: Recall accuracy past the first 64k tokens suffers badly; Cost is astronomical; Response latency is too high for most interactive use-cases.

I would point out the astounding leap in input context in just one year. Should we assume effectively-infinite (RAG-free) context in the near-future?

anoncareer0212

0 replies

1d17h

2024-03-29 00:59:46 UTC

This is grossly untrue in a way that denotes surface-level familiarity on several fronts

You're referring to the needle-in-a-haystack retrieval problem.

Which the person you're replying to explicitly mentioned is the only benchmark providers are using, for good reason.

Consider the "translate Moby Dick to comedic zoomer" problem. This does not even come remotely close to working unless I do it in maximum chunks of 5,000 tokens.

Consider the API output limit of 4096 tokens, across all providers.

And no, you shouldn't assume effectively infinite (RAG free) context in the near future. This time last year, Anthropic was demonstrating 120,000 token context. It released 200K a few weeks ago. And runtime cost scales with N^2.

binalpatel

0 replies

1d22h

2024-03-28 19:51:21 UTC

Gemini 1.5 Pro is really good at long context in my experience.

samus

2 replies

1d22h

2024-03-28 19:50:24 UTC

This one should have you covered :-) one out of every eight layers is a traditional Transformer layer, which should ensure precision, at least over short distances.

swyx

1 replies

1d18h

2024-03-29 00:04:40 UTC

which should ensure precision, at least over short distances.

why? i dont follow. transformers should provide some attention over -all- distances, no? why does layering truncate this to "short distances"?

samus

0 replies

1d17h

2024-03-29 01:24:22 UTC

I mean "short" in comparison to the unlimited, but lossy recall that the Mamba blocks provide. Transformers are limited to the context length, while Mamba can carry along state. While it can remember things from a lot farther back, it is limited and must thus eventually drop things and/or lose precision.

gautamcgoel

8 replies

2024-03-28 18:07:05 UTC

Why include self-attention layers at all? In other words, why not just alternate SSM and MLP layers?

NLPaep

7 replies

2024-03-28 18:23:55 UTC

Mamba is bad with long context. It doesn't remember phone numbers

https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...

Rodeoclash

3 replies

1d21h

2024-03-28 21:18:56 UTC

I can't remember phone numbers either but I can use a device suited to remembering them to look them up

orra

1 replies

1d20h

2024-03-28 21:32:23 UTC

Hell, it looks like you forgot you already said that (-:

Rodeoclash

0 replies

1d18h

2024-03-28 23:30:09 UTC

Haha, I blame the Harmonic app :/

imtringued

0 replies

1d10h

2024-03-29 07:43:59 UTC

What if your field of vision was infinite and you are looking at a unrolled telephone book?

Would you need a device to remember the phone number? You wouldn't. You would need a method or algorithm to find the number, but there is no reason why that algorithm couldn't be part of the attention mechanism. The attention mechanism is akin to reading the entire phone book for every word you are about to say. It would be unreasonable to expect you to not find the right phone number eventually.

a_wild_dandan

1 replies

1d21h

2024-03-28 20:34:58 UTC

Good! DNNs unlock semantics (parsing, transforming, producing). That's the basis of general intelligence, not encyclopedic random string recall. Models shouldn't burn ungodly quantities of compute emulating DDR5 with their working memory. We need machines that think better, not memorize well. We already have plenty of those.

Massive context windows, and their needle tests, are misguided. We won't reach human-level AGI by basically inventing a natural language RDBMS. Our resources should primarily target better reasoning systems for our models, reinforcement learning, etc.

If we can build a GPT4-level problem solving system that coincidentally also can't remember telephone numbers, I'll consider it major progress.

6gvONxR4sf7o

0 replies

1d19h

2024-03-28 23:24:03 UTC

Memorization usually refers to training data. It's often useful to have something that can utilize instructions losslessly, which is the distinction between these models.

Rodeoclash

0 replies

1d21h

2024-03-28 21:18:57 UTC

I can't remember phone numbers either but I can use a device suited to remembering them to look them up.

kelseyfrog

7 replies

2024-03-28 17:57:05 UTC

I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it has a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?

refulgentis

3 replies

2024-03-28 18:12:36 UTC

I'm not sure I follow fully, it is also the case for (handwaves) "traditional" LLMs that state + input = next state + output. Its just that output increases, so as output becomes input, eventually state + input / next state + output is greater than the context size.

Re: linear scaling, that means the runtime cost is O(n) to context size, rather than traditional transformer O(n^2)

maccam912

1 replies

1d23h

2024-03-28 19:14:36 UTC

I think kelseyfrog meant that the state for a mamba model is supposed to "remember" stuff even if it doesn't have the actual tokens to reference any more. It might not be guaranteed to hang on to some information about tokens from a long time ago, but at least in theory it's possible, whereas tokens from before a context window in a tradional llms may as well never have existed.

kelseyfrog

0 replies

1d22h

2024-03-28 19:30:46 UTC

Yes, you said it better than I did :)

visarga

0 replies

1d23h

2024-03-28 19:19:01 UTC

That is valid for Mamba, this model (Jamba) is a mix of transformer and mamba layers, so it still has a quadratic memory cost, but divided by 8.

a_wild_dandan

1 replies

1d21h

2024-03-28 20:43:18 UTC

state = context

The difference between SSMs and GPTs here is how that state/context scales. Per usual in engineering, there are big trade offs!

kelseyfrog

0 replies

1d21h

2024-03-28 20:55:48 UTC

I'm not following. State is a multi-dimensional vector and context is a list of tokens. State is perturbed by A and Bx(t), while context is appended to by sampling the predicted token distribution.

spxneo

0 replies

1d20h

2024-03-28 22:09:01 UTC

256k is huge dude. that is like 1/2 of the average non fiction novel

i think at least 200~300 pages of PDF

im not complaining here and it also fits in GPU

krasin

3 replies

2024-03-28 18:04:28 UTC

The license is a proper open-source one: Apache 2.0. Thanks, AI21 Labs.

spxneo

1 replies

1d20h

2024-03-28 22:11:22 UTC

im so used to seeing AGPLv3

apache 2 is a more generous license

krasin

0 replies

1d20h

2024-03-28 22:19:15 UTC

AGPLv3 is a fine license too. But most of the models nowadays come with bullshit licenses, like Llama 2 with its "acceptable use policy" enforced by the license: https://ai.meta.com/llama/use-policy/

popalchemist

0 replies

1d21h

2024-03-28 20:50:42 UTC

In addition to the architectural and performance benefits, this is the big deal here, IMO.

smusamashah

2 replies

2024-03-28 18:17:43 UTC

There was a recent thread on explaining Mamba https://news.ycombinator.com/item?id=39501982 (https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html)

There was another one on the same thing, probably better https://news.ycombinator.com/item?id=39482428 (https://jackcook.com/2024/02/23/mamba.html)

dang

1 replies

1d23h

2024-03-28 19:11:22 UTC

Thanks! Macroexpanded:

Mamba Explained: The State Space Model Taking On Transformers - https://news.ycombinator.com/item?id=39501982 - Feb 2024 (93 comments)

Mamba: The Easy Way - https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60 comments)

Is Mamba Capable of In-Context Learning? - https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1 comment)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional SSM - https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16 comments)

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts - https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39 comments)

Implementation of Mamba in one file of PyTorch - https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109 comments)

Show HN: Fortran inference code for the Mamba state space language model - https://news.ycombinator.com/item?id=38687342 - Dec 2023 (1 comment)

Guide to the Mamba architecture that claims to be a replacement for Transformers - https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2 comments)

Mamba outperforms transformers "everywhere we tried" - https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25 comments)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces - https://news.ycombinator.com/item?id=38522428 - Dec 2023 (37 comments)

Mamba: New SSM arch with linear-time scaling that outperforms Transformers - https://news.ycombinator.com/item?id=38520992 - Dec 2023 (2 comments)

garyiskidding

0 replies

1d13h

2024-03-29 04:58:46 UTC

thank you, these are very helpful.

google234123

2 replies

1d22h

2024-03-28 19:29:54 UTC

I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix

uoaei

1 replies

1d22h

2024-03-28 19:52:00 UTC

Surprised they hadn't found ways to advance their techniques with e.g. low-rank approximations, etc.

theGnuMe

0 replies

1d16h

2024-03-29 02:09:55 UTC

That’s one strategy. Also flash attention.

cs702

2 replies

1d22h

2024-03-28 20:19:53 UTC

Please link to the original post:

https://www.ai21.com/blog/announcing-jamba

Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.

The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.

Thank you to the folks at AI21 for making Jamba available!

swyx

1 replies

1d19h

2024-03-28 22:54:59 UTC

i havent seen anyone mention this yet so i'll be the first - what is the comparison vs StripedHyena? https://www.together.ai/blog/stripedhyena-7b

cs702

0 replies

1d18h

2024-03-28 23:42:32 UTC

Mamba came out of the same research group, Hazy Research, led by Chris Ré. This new "Jamba" model incorporating Mamba and dot-product attention layers has ~8x more parameters than the largest open Striped Hyena, and appears to work much better.

moneycantbuy

1 replies

1d18h

2024-03-28 23:28:55 UTC

would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.

lhl

0 replies

9h24m

2024-03-30 09:01:23 UTC

llama.cpp probably won't be getting Jamba support anytime soon: https://github.com/ggerganov/llama.cpp/issues/6372#issuecomm...

There is an MLX Mamba implementation, but nothing for Jamba either: https://github.com/alxndrTL/mamba.py/tree/main/mlx

You could run PyTorch on CPU and w/ a 12B activation pass, it might even run relatively fast (8 tok/s?), but a q4 quant would also easily fit on 2x3090s and should run at >60 tok/s.

htrp

1 replies

2024-03-28 18:00:03 UTC

compute still has cost?

samus

0 replies

1d22h

2024-03-28 19:55:09 UTC

In not sure I understood your question.

This model should have much lower computational cost since only one out of eight layers is a traditional transformer layer with masked self-attention. Additionally, half of the Mamba layers are MoEs.

haddr

1 replies

1d23h

2024-03-28 19:23:12 UTC

Will it be possible to run such model family in ollama?

andy99

0 replies

1d23h

2024-03-28 19:24:34 UTC

Mamba is supported in llama.cpp so should be (edit - apparently it's not strictly the mamba architecture, it's a mix of mamba and transformers, so it looks like it would have to be ported to llama.cpp)

a_wild_dandan

1 replies

2024-03-28 18:18:06 UTC

To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: https://www.youtube.com/watch?v=dKJEpOtVgXc

az226

0 replies

1d14h

2024-03-29 04:06:05 UTC

They use less memory for inference but remember the details less well. For instance if you’re implementing code and want edits, it will forget various functions to be part of the script. Even transformers aren’t perfect at this and SSMs are even worse. For many use cases, that ability isn’t needed as much so the memory savings is a bigger lever.

zzzzzzzzzz10

0 replies

1d10h

2024-03-29 07:40:02 UTC

Where can I download and use it?

zelphirkalt

0 replies

1d17h

2024-03-29 01:23:41 UTC

Is there a Sparabo too?

It is always funny to see old names associated with totally different new things!

unraveller

0 replies

1d21h

2024-03-28 21:19:36 UTC

Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.

toddmorey

0 replies

1d17h

2024-03-29 01:25:21 UTC

Released with open weights!

sleepingreset

0 replies

1d21h

2024-03-28 21:09:57 UTC

god damn

ninjahatori

0 replies

1d22h

2024-03-28 20:12:40 UTC

On a side note: working over longer contexts also reminds me of MemGPT(https://github.com/cpacker/MemGPT) I think a similar concept can be applied to Mamba architecture models too.

kjkjadksj

0 replies

1d2h

2024-03-29 15:39:55 UTC

People need to pick better names. Mamba is already a popular python package and internet search tools are on their knees already.

ipsum2

0 replies

1d22h

2024-03-28 19:46:24 UTC

@dang this is blogspam for the official post: https://www.ai21.com/blog/announcing-jamba

eigenvalue

0 replies

1d22h

2024-03-28 20:17:28 UTC

Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.

CGamesPlay

0 replies

1d14h

2024-03-29 04:23:07 UTC

Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.