return to table of content

Jamba: Production-grade Mamba-based AI model

skybrian
16 replies
2d

Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.

I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.

electric_mayhem
6 replies
2d

It’s literally simulating a neural network.

How much of your 5-sense experiential memories and decades of academic book learning are you bringing to understand my reply to your post?

How many gigabytes do you think that’s equivalent to?

richardw
2 replies
1d19h

It’s kinda simulating our brains but not really. When I attempted to dig more into how neurons work I realised that it’s a massive chasm of difference. Very much worth doing if you haven’t (you might know far better then me, this is for people who don’t yet.)

In terms of results: Our brains are working with 20w of power and can be trained to compete with LLM’s using a tiny fraction of the world’s data. They also have to keep you breathing and your blood pumping and manage all the dangers of catching a ball near traffic. Or skiing, or poetry, or sunsets. And they remember stuff five minutes later and don’t need a training run that takes months.

We have SO many opportunities to improve the AI architecture it’s ridiculous. This is a good thing.

reissbaker
0 replies
1d18h

To be fair most of the brain is more like a pretrained model — it isn't being trained at any point after conception to keep your blood pumping or your lungs working, it does that out of the box roughly as soon as you sprout those organs (or the minute you're born, in the case of lungs). The training process was billions of years of evolution. And, well, given fairly persistent cross-cultural cognitive biases, I expect the conscious thought parts are starting from a pretrained model, too, and all we're doing in school is finetuning ;)

imtringued
0 replies
1d10h

People don't understand that to simulate a single neuron, you need an entire neural network. So 70 billion parameters might at best be equivalent to a million neurons but that is assuming that your neural network architecture is akin to the connections between neurons. Considering the physical sparsity, you might need even more parameters to model the connections of a biological neural network. So less than a million neurons in practice.

imtringued
0 replies
1d10h

So what? I have seen models distributed as 26x 10GB files.

_false
0 replies
1d23h

I love both parent post perspectives on this.

pama
3 replies
1d14h

The big (huge?) memory requirement is during training. These LLMs work with high dimensional vectors and they calculate gradients with respect to high dimensional vectors and they do updates that require state of the optimizer. If you have 3 particles in 3 dimensions and you need their forces that creates 3 new 3D vectors and once you update their position along the forces then they also carry momenta. Now generalize these simple 3-body physics to the typical 60-layer creatures inside the LLM with vectors of several thousand dimensions, interactions/weights that are scaling like the squares of these vectors, to a total parameter count that adds up to the 10s to 100s of billions of parameters, and then take derivatives and start to keep track of momenta. It is a feat of modern engineering that some groups can train such models efficiently. I hope we will see more of the training stories becoming public in the near future.

nl
2 replies
1d10h

This is wrong. You need big memory during inference too.

The difference there is you can use tricks like quantisation and offloading to CPU to reduce it somewhat at the cost of accuracy and/or speed.

pama
0 replies
16h29m

Not sure what you mean by wrong. I have never encountered a case yet when training an LLM (no matter what architecture) would require limited memory and was pointing out that the typical memory requirements for training are much higher yet than the typical requirements for inference.

brrrrrm
0 replies
1d4h

Training is 3x the memory used by inference, and usually run at a much larger batch size

nostrowski
1 replies
1d21h

Two things I'm curious to know:

1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU? 2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?

patrakov
0 replies
1d21h

Mixtral 8x7b runs well (i.e., produces the correct output faster than I can read it) on a modern AMD or Intel laptop without any use of a GPU - provided that you have enough RAM and CPU cores. 32 GB of RAM and 16 hyperthreads are enough with 4-bit quantization if you don't ask too much in terms of context.

P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.

riku_iki
0 replies
1d20h

that you need 80GB of GPU memory to analyze less than 1 megabyte of data

80GB is compressed all human knowledge applied on that 1mb..

nl
0 replies
1d10h

That’s all the world’s knowledge compressed into 80GB. It’s not analysing 1MB data, it’s analysing all of that knowledge plus and additional 1MB.

imtringued
0 replies
1d10h

Compared to the human brain they are shockingly efficient. It's the hardware that isn't, but that is just a matter of time.

Reubend
14 replies
2d

It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.

refulgentis
10 replies
2d

I would too -- long context has been such a red herring across providers, Claude 3 is the first I've seen that seems to genuinely have some sort of qualitative leap in noticing things.

It is worth noting I'm fairly sure there's no inherent theoratical decrease to accuracy in long contexts, the claimed theoratical change is an _increase_ in long-term accuracy in long contexts.

Arthur_ODC
5 replies
1d23h

Long Context is great and all, but it sucks that all of these LLM's have really poor output length. If I feed something an entire book and ask for a comprehensive summary then I'm expecting at least a full 3-page summary. I get that they try to force these things to be "concise" to save on compute, but good lord it's so annoying.

pedrovhb
2 replies
1d20h

Have you tried asking it for a specific concrete length, like a number of words? I was also frustrated with concise answers when asking for long ones, but I found that the outputs improved significantly if I asked for e.g. 4000 words specifically. Further than that, have it break it down into sections and write X words per section.

Arthur_ODC
1 replies
1d20h

Yes, all the possible length extending custom instructions you can think of. I can get some reasonable length responses out of it, but I've never seen them go over 1 page worth, and multi-shot example prompts using multiple USER and GPT exchanges to define the format. Seems like GPT4 has a hard limit as to how much it will output when you click "continue", and Claude Opus never goes over a page either. Another user pointed out using the API, which I have done in the past, but it's been a long while, and I can't really justify the cost of using the advanced models via API for my general use.

refulgentis
0 replies
1d19h

Everyone's coalescing at a max of 4096 tokens/12 "pages" via API (page is 250 words, which is 1 8.5"x11" double spaced)

To your point, doesn't matter anyway, it's nigh impossible to get over 2K of output with every trick and bit of guidance you can think of (I got desperate when 16K/48 pages came out to "make it work", even completely deforming tricks like making it number each line and write a reminder on each line that it should write 1000 lines don't work)

CuriouslyC
1 replies
1d23h

That's a chat gpt problem, if you hit the API it's not nearly so hard to get good output.

refulgentis
0 replies
1d22h

I wouldn't say that, my latest big user story for making sure I'm handling huge inputs was "translate Moby dick to zoomer". Cant give any service chunks larger than ~5K tokens, over API, without it failing.

(Miserably, like, I'd be fine if it gave a paragraph back. But at least on this "map" task, there's a critical point where there's so much input that the reward function ends up imitating the input more instead of chatting)

tempusalaria
2 replies
1d22h

Every long context sucks right now. All the model providers benchmark on fact recall which is very limited. Actual ability to do anything complicated beyond 16k tokens is not present in any current model I have seen.

ukuina
1 replies
1d17h

This is not current. GPT-4-Turbo (128k) has lossless recall to the first 64k input tokens and produces output indistinguishable from GPT-4 (32k), though both are limited to 4k output tokens.

Several downsides: Recall accuracy past the first 64k tokens suffers badly; Cost is astronomical; Response latency is too high for most interactive use-cases.

I would point out the astounding leap in input context in just one year. Should we assume effectively-infinite (RAG-free) context in the near-future?

anoncareer0212
0 replies
1d17h

This is grossly untrue in a way that denotes surface-level familiarity on several fronts

You're referring to the needle-in-a-haystack retrieval problem.

Which the person you're replying to explicitly mentioned is the only benchmark providers are using, for good reason.

Consider the "translate Moby Dick to comedic zoomer" problem. This does not even come remotely close to working unless I do it in maximum chunks of 5,000 tokens.

Consider the API output limit of 4096 tokens, across all providers.

And no, you shouldn't assume effectively infinite (RAG free) context in the near future. This time last year, Anthropic was demonstrating 120,000 token context. It released 200K a few weeks ago. And runtime cost scales with N^2.

binalpatel
0 replies
1d22h

Gemini 1.5 Pro is really good at long context in my experience.

samus
2 replies
1d22h

This one should have you covered :-) one out of every eight layers is a traditional Transformer layer, which should ensure precision, at least over short distances.

swyx
1 replies
1d18h

which should ensure precision, at least over short distances.

why? i dont follow. transformers should provide some attention over -all- distances, no? why does layering truncate this to "short distances"?

samus
0 replies
1d17h

I mean "short" in comparison to the unlimited, but lossy recall that the Mamba blocks provide. Transformers are limited to the context length, while Mamba can carry along state. While it can remember things from a lot farther back, it is limited and must thus eventually drop things and/or lose precision.

gautamcgoel
8 replies
2d

Why include self-attention layers at all? In other words, why not just alternate SSM and MLP layers?

Rodeoclash
3 replies
1d21h

I can't remember phone numbers either but I can use a device suited to remembering them to look them up

orra
1 replies
1d20h

Hell, it looks like you forgot you already said that (-:

Rodeoclash
0 replies
1d18h

Haha, I blame the Harmonic app :/

imtringued
0 replies
1d10h

What if your field of vision was infinite and you are looking at a unrolled telephone book?

Would you need a device to remember the phone number? You wouldn't. You would need a method or algorithm to find the number, but there is no reason why that algorithm couldn't be part of the attention mechanism. The attention mechanism is akin to reading the entire phone book for every word you are about to say. It would be unreasonable to expect you to not find the right phone number eventually.

a_wild_dandan
1 replies
1d21h

Good! DNNs unlock semantics (parsing, transforming, producing). That's the basis of general intelligence, not encyclopedic random string recall. Models shouldn't burn ungodly quantities of compute emulating DDR5 with their working memory. We need machines that think better, not memorize well. We already have plenty of those.

Massive context windows, and their needle tests, are misguided. We won't reach human-level AGI by basically inventing a natural language RDBMS. Our resources should primarily target better reasoning systems for our models, reinforcement learning, etc.

If we can build a GPT4-level problem solving system that coincidentally also can't remember telephone numbers, I'll consider it major progress.

6gvONxR4sf7o
0 replies
1d19h

Memorization usually refers to training data. It's often useful to have something that can utilize instructions losslessly, which is the distinction between these models.

Rodeoclash
0 replies
1d21h

I can't remember phone numbers either but I can use a device suited to remembering them to look them up.

kelseyfrog
7 replies
2d

I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it has a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?

refulgentis
3 replies
2d

I'm not sure I follow fully, it is also the case for (handwaves) "traditional" LLMs that state + input = next state + output. Its just that output increases, so as output becomes input, eventually state + input / next state + output is greater than the context size.

Re: linear scaling, that means the runtime cost is O(n) to context size, rather than traditional transformer O(n^2)

maccam912
1 replies
1d23h

I think kelseyfrog meant that the state for a mamba model is supposed to "remember" stuff even if it doesn't have the actual tokens to reference any more. It might not be guaranteed to hang on to some information about tokens from a long time ago, but at least in theory it's possible, whereas tokens from before a context window in a tradional llms may as well never have existed.

kelseyfrog
0 replies
1d22h

Yes, you said it better than I did :)

visarga
0 replies
1d23h

That is valid for Mamba, this model (Jamba) is a mix of transformer and mamba layers, so it still has a quadratic memory cost, but divided by 8.

a_wild_dandan
1 replies
1d21h

state = context

The difference between SSMs and GPTs here is how that state/context scales. Per usual in engineering, there are big trade offs!

kelseyfrog
0 replies
1d21h

I'm not following. State is a multi-dimensional vector and context is a list of tokens. State is perturbed by A and Bx(t), while context is appended to by sampling the predicted token distribution.

spxneo
0 replies
1d20h

256k is huge dude. that is like 1/2 of the average non fiction novel

i think at least 200~300 pages of PDF

im not complaining here and it also fits in GPU

krasin
3 replies
2d

The license is a proper open-source one: Apache 2.0. Thanks, AI21 Labs.

spxneo
1 replies
1d20h

im so used to seeing AGPLv3

apache 2 is a more generous license

krasin
0 replies
1d20h

AGPLv3 is a fine license too. But most of the models nowadays come with bullshit licenses, like Llama 2 with its "acceptable use policy" enforced by the license: https://ai.meta.com/llama/use-policy/

popalchemist
0 replies
1d21h

In addition to the architectural and performance benefits, this is the big deal here, IMO.

dang
1 replies
1d23h

Thanks! Macroexpanded:

Mamba Explained: The State Space Model Taking On Transformers - https://news.ycombinator.com/item?id=39501982 - Feb 2024 (93 comments)

Mamba: The Easy Way - https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60 comments)

Is Mamba Capable of In-Context Learning? - https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1 comment)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional SSM - https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16 comments)

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts - https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39 comments)

Implementation of Mamba in one file of PyTorch - https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109 comments)

Show HN: Fortran inference code for the Mamba state space language model - https://news.ycombinator.com/item?id=38687342 - Dec 2023 (1 comment)

Guide to the Mamba architecture that claims to be a replacement for Transformers - https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2 comments)

Mamba outperforms transformers "everywhere we tried" - https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25 comments)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces - https://news.ycombinator.com/item?id=38522428 - Dec 2023 (37 comments)

Mamba: New SSM arch with linear-time scaling that outperforms Transformers - https://news.ycombinator.com/item?id=38520992 - Dec 2023 (2 comments)

garyiskidding
0 replies
1d13h

thank you, these are very helpful.

google234123
2 replies
1d22h

I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix

uoaei
1 replies
1d22h

Surprised they hadn't found ways to advance their techniques with e.g. low-rank approximations, etc.

theGnuMe
0 replies
1d16h

That’s one strategy. Also flash attention.

cs702
2 replies
1d22h

Please link to the original post:

https://www.ai21.com/blog/announcing-jamba

Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.

The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.

Thank you to the folks at AI21 for making Jamba available!

cs702
0 replies
1d18h

Mamba came out of the same research group, Hazy Research, led by Chris Ré. This new "Jamba" model incorporating Mamba and dot-product attention layers has ~8x more parameters than the largest open Striped Hyena, and appears to work much better.

moneycantbuy
1 replies
1d18h

would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.

htrp
1 replies
2d

compute still has cost?

samus
0 replies
1d22h

In not sure I understood your question.

This model should have much lower computational cost since only one out of eight layers is a traditional transformer layer with masked self-attention. Additionally, half of the Mamba layers are MoEs.

haddr
1 replies
1d23h

Will it be possible to run such model family in ollama?

andy99
0 replies
1d23h

Mamba is supported in llama.cpp so should be (edit - apparently it's not strictly the mamba architecture, it's a mix of mamba and transformers, so it looks like it would have to be ported to llama.cpp)

a_wild_dandan
1 replies
2d

To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: https://www.youtube.com/watch?v=dKJEpOtVgXc

az226
0 replies
1d14h

They use less memory for inference but remember the details less well. For instance if you’re implementing code and want edits, it will forget various functions to be part of the script. Even transformers aren’t perfect at this and SSMs are even worse. For many use cases, that ability isn’t needed as much so the memory savings is a bigger lever.

zzzzzzzzzz10
0 replies
1d10h

Where can I download and use it?

zelphirkalt
0 replies
1d17h

Is there a Sparabo too?

It is always funny to see old names associated with totally different new things!

unraveller
0 replies
1d21h

Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.

toddmorey
0 replies
1d17h

Released with open weights!

sleepingreset
0 replies
1d21h

god damn

ninjahatori
0 replies
1d22h

On a side note: working over longer contexts also reminds me of MemGPT(https://github.com/cpacker/MemGPT) I think a similar concept can be applied to Mamba architecture models too.

kjkjadksj
0 replies
1d2h

People need to pick better names. Mamba is already a popular python package and internet search tools are on their knees already.

eigenvalue
0 replies
1d22h

Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.

CGamesPlay
0 replies
1d14h

Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.