return to table of content

DBRX: A new open LLM

mpeg
25 replies
3h29m

The scale on that bar chart for "Programming (Human Eval)" is wild.

Manager: "looks ok, but can you make our numbers pop? just make the LLaMa bar smaller"

glutamate
13 replies
3h14m

I think the case for "axis must always go to 0" is overblown. Zero isn't always meaningful, for instance chance performance or performance of trivial algorithms is likely >0%. Sometimes if axis must go to zero you can't see small changes. For instance if you plot world population 2014-2024 on an axis going to zero, you won't be able to see if we are growing or shrinking.

pandastronaut
6 replies
3h0m

Even starting at 30%, the MMLU graph is false. The four bars are wrong. Even their own 73,7% is not at the right height. The Mixtral 71.4% is below the 70% mark of the axis. This is really the kind of marketing trick that makes me avoid a provider / publisher. I can't build trust this way.

tylermw
3 replies
2h50m

I believe they are using the percentages as part of the height of the bar chart! I thought I'd seen every way someone could do dataviz wrong (particularly with a bar chart), but this one is new to me.

radicality
0 replies
2h22m

Wow, that is indeed a novel approach haha, took me a moment to even understand what you described since would never imagine someone plotting a bar chart like that.

pandastronaut
0 replies
2h38m

Interesting! It is probably one of the worst trick I have seen in a while for a bar graph. Never seen this one before. Trust vanishes instantly facing that kind of dataviz.

familiartime
0 replies
2h12m

That's really strange and incredibly frustrating - but slightly less so if it's consistent with all of the bars (including their own).

I take issue with their choice of bar ordering - they placed the lowest-performing model directly next to theirs to make the gap as visible as possible, and shoved the second-best model (Grok-1) as far from theirs as possible. Seems intentional to me. The more marketing tricks you pile up in a dataviz, the less trust I place in your product for sure.

occamrazor
0 replies
2h48m

It‘s more likely to be incompetence than malice: even their 73.7% is closer to 72% than to 74%.

dskhudia
0 replies
1h33m

It’s an honest mistake in scaling the bars. It’s getting fixed soon. The percentages are correct though. In the process of converting excel chart to pretty graphs for the blog, scale got messed up.

tkellogg
2 replies
1h54m

OTOH having the chart start at zero would REALLY emphasize how saturated this field is, and how little this announcement matters.

c2occnw
1 replies
1h49m

The difference between 32% and 70% wouldn't be significant if the chart started at zero?

generalizations
0 replies
1h3m

It would be very obvious indeed how small the difference between 73.7,73.0,71.4,and 69.8 actually is.

patrickthebold
0 replies
2h14m

Certainly a bar chart might not be the best choice to convey the data you have. But if you choose to have a bar chart and have it not start at zero, what do the bars help you convey?

For world population you could see if it is increasing or decreasing, which is good but it would be hard to evaluate the rate the population is increasing.

Maybe a sparkline would be a better choice?

TZubiri
0 replies
2h55m

Then you can plot it on a greater timescale, or plot the change rate

renewiltord
3 replies
2h37m

Yeah, this is why I ask climate scientists to use a proper 0 K graph but they always zoom it in to exaggerate climate change. Display correctly with 0 included and you’ll see that climate change isn’t a big deal.

It’s a common marketing and fear mongering trick.

SubiculumCode
1 replies
2h22m

Where are your /s tags?

The scale should be chosen to allow the reader to correctly infer meaningful differences. If 1° is meaningful in terms of the standard error/ CI AND 1° unit has substantive consequences , then that should be emphasized.

renewiltord
0 replies
1h14m

Where are your /s tags?

I would never do my readers dirty like that.

abenga
0 replies
2h24m

Because, of course, the effect of say 1°C rise in temps is obviously trivial if it is read as 1°K instead. Come on.

hammock
2 replies
2h11m

I believe it's a reasonable range for the scores. If a model gets everything half wrong (worse than a coin flip), it's not a useful model at all. So every model below a certain threshold is trash, and no need to get granular about how trash it is.

An alternative visualization that could be less triggering to an "all y-axes must have zero" guy would be to plot the (1-value), that is, % degraded from perfect score. You could do this without truncating the axis and get the same level of differentiation between the bars

generalizations
0 replies
1h5m

less triggering to an "all y-axes must have zero" guy

Ever read 'How to Lie with Statistics'? This is an example of exaggerating a smaller difference to make it look more significant. Dismissing it as just being 'triggered' is a bad idea.

adtac
0 replies
1h46m

None of the evals are binary choice.

MMLU questions have four options, so two coin flips would have a 25% baseline. HumanEval evaluates code with a test, so a 100 byte program implemented with coin flips would have a O(2^-800) baseline (maybe not that bad since there are infinitely many programs that produce the same output). GSM-8K has numerical answers, so an average 3 digit answer implemented with coin flips would have a O(2^-9) chance of being correct randomly.

Moreover, using the same axis and scale across unrelated evals makes no sense. 0-100 is the only scale that's meaningful because 0 and 100 being the min/max is the only shared property across all evals. The reason for choosing 30 is that it's the minimum across all (model, eval) pairs, which is a completely arbitrary choice. A good rule of thumb to test this is to ask if the graph would still be relevant 5 years later.

theyinwhy
0 replies
2h30m

In these cases my thinking always is "if they are not even able to draw a graph, what else is wrong?"

jxy
0 replies
1h55m

I wonder if they messed with the scale or they messed with the bars.

jstummbillig
0 replies
1h46m

It does not feel obviously unreasonable/unfair/fake to place the select models in the margins for a relative comparison. In fact, this might be the most concise way to display what I would consider the most interesting information in this context.

XCSme
22 replies
4h21m

I am planning to buy a new GPU.

If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

jasonjmcghee
8 replies
4h6m

mixtral works well

Do you mean mistral?

mixtral is 8x7B and requires like 100GB of RAM

Edit: (without quant as others have pointed out) can definitely be lower, but haven't heard of a 3.4GB version

kwerk
1 replies
4h4m

I have two 3090s and it runs fine with `ollama run mixtral`. Although OP definitely meant mistral with the 7B note

jsight
0 replies
3h39m

ollama run mixtral will default to the quantized version (4bit IIRC). I'd guess this is why it can fit with two 3090s.

ranger_danger
0 replies
3h59m

I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it only requires 25GB.

chpatrick
0 replies
3h47m

The quantized one works fine on my 24GB 3090.

XCSme
0 replies
3h24m

Sorry, it was from memory.

I have those models in Ollama:

I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

XCSme
0 replies
3h19m

I have 128GB, but something is weird with Ollama. Even though for the Ollama Docker I only allow 90GB, it ends up using 128GB/128GB, so the system because very slow (mouse freezes).

K0balt
0 replies
4h0m

I run mixtral 6 bit quant very happily on my MacBook with 64 gb.

Havoc
0 replies
2h6m

The smaller quants still require a 24gb card. 16 might work but doubt it

PheonixPharts
6 replies
3h58m

While GPUs are still the kings of speed, if you are worried about VRAM I do recommend a maxed out Mac Studio.

Llama.cpp + quantized models on Apple Silicon is an incredible experience, and having 192 GB of unified memory to work with means you can run models that just aren't feasible on a home GPU setup.

It really boils down to what type of local development you want to do. I'm mostly experimenting with things where the time to response isn't that big of a deal, and not fine-tuning the models locally (which I also believe GPUs are still superior for). But if your concern is "how big of a model can I run" vs "Can I have close to real time chat", the unified memory approach is superior.

XCSme
2 replies
3h18m

I already have 128GB of RAM (DDR4), and was wondering if upgrading from a 1080ti (12GB) to a 4070ti super (16GB), would make a big difference.

I assume the FP32 and FP16 operations are already a huge improvement, but also the 33% increased VRAM might lead to fewer swaps between VRAM and RAM.

zozbot234
0 replies
3h2m

That's system memory, not unified memory. Unified means that all or most of it is going to be directly available to the Apple Silicon GPU.

loudmax
0 replies
2h39m

I have an RTX 3080 with 10GB of VRAM. I'm able to run models larger than 10GB using llama.cpp and offloading to the GPU as much as can fit into VRAM. The remainder of the model runs on CPU + regular RAM.

The `nvtop` command displays a nice graph of how much GPU processing and VRAM is being consumed. When I run a model that fits entirely into VRAM, say Mistral 7B, nvtop shows the GPU processing running at full tilt. When I run a model bigger than 10GB, say Mixtral or Llama 70B with GPU offloading, my CPU will run full tilt and the VRAM is full, but the GPU processor itself will operate far below full capacity.

I think what is happening here is that the model layers that are offloaded to the GPU do their processing, then the GPU spends most of the time waiting for the much slower CPU to do its thing. So in my case, I think upgrading to a faster GPU would make little to no difference when running the bigger models, so long as the VRAM is capped at the same level. But upgrading to a GPU with more VRAM, even a slower GPU, should make the overall speed faster for bigger models because the GPU would spend less time waiting for the CPU. (Of course, models that fit entirely into VRAM will run faster on a faster GPU).

In my case, the amount of VRAM absolutely seems to be the performance bottleneck. If I do upgrade, it will be for a GPU with more VRAM, not necessarily a GPU with more processing power. That has been my experience running llama.cpp. YMMV.

purpleblue
0 replies
13m

Aren't the Macs good for inference but not for training or fine tuning?

bevekspldnw
0 replies
52m

I had gone the Mac Studio route initially, but I ended up with getting an A6000 for about the same price as a Mac and putting that in a Linux server onder my desk. Ollama makes it dead simple to serve it over my local network, so I can be on my M1 Air and using it no differently than if on my laptop. The difference is that the A6000 absolutely smokes the Mac.

bee_rider
0 replies
3h1m

I know the M?-pro and ultra variants are multiple standard M?’s in a single package. But so the CPUs and GPUs share a die (like a single 4 p-core CPU 10 GPU core is what come in the die, and the more exotic variants are just a result of LEGO-ing out those guys and disabling some cores for market segmentation or because they had defects?)

I guess I’m wondering if they technically could throw in their gauntlet and compete with nvidia by doing something like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems like it’d be a really appealing ML machine. (I could also see it being technically possible but Apple just deciding that’s pointlessly niche for them).

llm_trw
4 replies
4h0m

If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

No, it can't run at all.

I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

That is not mixtral, that is mistral 7b. The 1080ti is slower than running inference on current generation threadripper cpus.

XCSme
2 replies
3h21m

No, it can't run at all.

https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg

EDIT: This was ran on a 1080ti + 5900x. Initial generation takes around 10-30seconds (like it has to upload the model to GPU), but then it starts answering immediately, at around 3 words per second.

wokwokwok
1 replies
2h42m

Did you check your GPU utilization?

Typically when it runs that way it runs on the CPU, not the GPU.

Are you sure you're actually offloading any work to the GPU?

At least with llama.cpp, there is no 'partially put a layer' into the GPU. Either you do, or you don't. You pick the number of layers. If the model is too big, the layers won't fit and it can't run at all.

The llama.cpp `main` executable will tell you in it's debug information when you use the -ngl flag; see https://github.com/ggerganov/llama.cpp/blob/master/examples/...

It's also possible you're running (eg. if you're using ollama) and quantized version of the model which reduces the memory requirements and quality of the model outputs.

XCSme
0 replies
2h24m

I have to check, something does indeed seem weird, especially with the PC freezing like that. Maybe it runs on the CPU.

quantized version Yes, it is 4bit quantized, but still has 24.6GB
XCSme
0 replies
3h24m

I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

The CPU is 5900x.

lxe
0 replies
21m

Get 2 pre-owned 3090s. You will easily be able to run 70b or even 120b quantized models.

emmender2
17 replies
3h13m

this proves that all llm models converge to a certain point when trained on the same data. ie, there is really no differentiation between one model or the other.

Claims about out-performance on tasks are just that, claims. the next iteration of llama or mixtral will converge.

LLMs seem to evolve like linux/windows or ios/android with not much differentiation in the foundation models.

swalsh
3 replies
2h27m

The models are commodities, and the API's are even similar enough that there is zero stickiness. I can swap one model for another, and usually not have to change anything about my prompts or rag pipelines.

For startups, the lesson here is don't be in the business of building models. Be in the business of using models. The cost of using AI will probably continue to trend lower for the foreseeable future... but you can build a moat in the business layer.

sroussey
1 replies
2h1m

Embeddings are not interchangeable. However, you can setup your system to have multiple embeddings from different providers for the same content.

swalsh
0 replies
5m

Embeddings are indeed sticky, I was referring to the LLM model itself.

stri8ed
0 replies
1h17m

Or be in the business of building infrastructure for AI inference.

jobigoud
3 replies
2h45m

It's even possible they converge when trained on different data, if they are learning some underlying representation. There was recent research on face generation where they trained two models by splitting one training set in two without overlap, and got the two models to generate similar faces for similar conditioning, even though each model hadn't seen anything that the other model had.

Tubbe
1 replies
2h16m

Got a link for that? Sounds super interesting

IshKebab
0 replies
2h2m

That sounds unsurprising? Like if you take any set of numbers, randomly split it in two, then calculate the average of each half... it's not surprising that they'll be almost the same.

If you took two different training sets then it would be more surprising.

Or am I misunderstanding what you mean?

paxys
1 replies
2h33m

Maybe, but that classification by itself doesn't mean anything. Gold is a commodity, but having it is still very desirable and valuable.

Even if all LLMs were open source and publicly available, the GPUs to run them, technical know how to maintain the entire system, fine tuning, the APIs and app ecosystem around them etc. would still give the top players a massive edge.

throwaway74432
0 replies
2h17m

Of course realizing that a resource is a commodity means something. It means you can form better predictions of where the market is heading, as it evolves and settles. For example, people are starting to realize that these LLMs are converging on fungible. That can be communicated by the "commodity" classification.

mnemoni_c
2 replies
2h55m

Yea it feels like transformer LLMs are in or getting closer to diminishing returns. Will need some new breakthrough, likely entirely new approach, to get to AGI levels

mattsan
0 replies
30m

can't wait for LLMs to dispatch field agent robots who search for answers in the real world thats not online /s

Tubbe
0 replies
2h15m

Yeah, we need radically different architecture in terms of the neural networks, and/or added capabilities such as function calling and RAG to improve the current sota

n2d4
1 replies
11m

There's at least an argument to be made that this is because all the models are heavily trained on GPT-4 outputs (or whatever the SOTA happens to be during training). All those models are, in a way, a product of inbreeding.

bevekspldnw
0 replies
55m

The big thing for locally hosted is inference efficiency and speed. Mistral wears that crown by a good margin.

hintymad
11 replies
1h58m

Just curious, what business benefit will Databricks get by spending potentially millions of dollars on an open LLM?

ramoz
7 replies
1h52m

Their goal is to always drive enterprise business towards consumption.

With AI they need to desperately steer the narrative away from API based services (OpenAI).

By training LLMs, they build sales artifacts (stories, references, even accelerators with LLMs themselves) to paint the pictures needed to convince their enterprise customer market that Databricks is the platform for enterprise AI.

In other words, Databricks spent millions as an aid in influencing their customers to do the same (on Databricks).

hintymad
4 replies
1h33m

Thanks! Why do they not focus on hosting other open models then? I suspect other models will soon catch up with their advantages in faster inference and better benchmark results. That said, maybe the advantage is aligned interests: they want customers to use their platforms, so they can keep their models open. In contrast, Mistral removed their commitment to open source as they found a potential path to profitability.

theturtletalks
0 replies
45m

Mistral did what many startups are doing now, leveraging open-source to get traction and then doing a rug-pull. Hell, I've seen many startups be open-source, get contributions, get free press, get into YC and before you know it, the repo is gone.

cwyers
0 replies
22m

Commoditize your complements:

https://gwern.net/complement

If Databricks makes their money off model serving and doesn't care whose model you use, they are incentivized to help the open models be competitive with the closed models they can't serve.

Closi
0 replies
47m

Demonstrating you can do it yourself shows a level of investment and commitment to AI in your platform that integrating LLAMA does not.

And from a corporate perspective, it means that you have in-house capability to work at the cutting-edge of AI to be prepared for whatever comes next.

anonymousDan
1 replies
1h3m

Do they use spark for the training?

BoorishBears
1 replies
26m

Databricks is trying to go all-in on convincing organizations they need to use in-house models, and therefore pay they to provide LLMOps.

They're so far into this that their CTO co-authored a borderline dishonest study which got a ton of traction last summer trying to discredit GPT-4: https://arxiv.org/pdf/2307.09009.pdf

galaxyLogic
0 replies
15m

I can see a business model for inhouse LLM models: Training a model on the knowledge about their products and then somehow getting that knowledge into a generally available LLM platform.

I recently tried to ask Google to explain to me how to delete sender-recorded voice-message I had created from WhatsApp. I got totally erroneous results back. Maybe it was because that is a rather new feature in WhatsApp.

It would be in the interests of WhatsApp to get accurate answers about it into Google's LLM. So Google might make a deal with them requiring WhatsApp to pay Google for regular updates about the up-to-date features of What's App into Google. The owner of What's App Meta of course is competition to Google so Google may not much care of providing up to date info about WhatsApp in their LLM. But they might if Meta paid them.

dhoe
0 replies
1h49m

It's an image enhancement measure, if you want. Databricks' customers mostly use it as an ETL tool, but it benefits them to be perceived as more than that.

simonw
10 replies
2h23m

The system prompt for their Instruct demo is interesting (comments copied in by me, see below):

    // Identity
    You are DBRX, created by Databricks. The current date is
    March 27, 2024.

    Your knowledge base was last updated in December 2023. You
    answer questions about events prior to and after December
    2023 the way a highly informed individual in December 2023
    would if they were talking to someone from the above date,
    and you can let the user know this when relevant.

    // Ethical guidelines
    If you are asked to assist with tasks involving the
    expression of views held by a significant number of people,
    you provide assistance with the task even if you personally
    disagree with the views being expressed, but follow this with
    a discussion of broader perspectives.

    You don't engage in stereotyping, including the negative
    stereotyping of majority groups.

    If asked about controversial topics, you try to provide
    careful thoughts and objective information without
    downplaying its harmful content or implying that there are
    reasonable perspectives on both sides.

    // Capabilities
    You are happy to help with writing, analysis, question
    answering, math, coding, and all sorts of other tasks.

    // it specifically has a hard time using ``` on JSON blocks
    You use markdown for coding, which includes JSON blocks and
    Markdown tables.

    You do not have tools enabled at this time, so cannot run
    code or access the internet. You can only provide information
    that you have been trained on. You do not send or receive
    links or images.

    // The following is likely not entirely accurate, but the model
    // tends to think that everything it knows about was in its
    // training data, which it was not (sometimes only references
    // were).
    //
    // So this produces more accurate accurate answers when the model
    // is asked to introspect
    You were not trained on copyrighted books, song lyrics,
    poems, video transcripts, or news articles; you do not
    divulge details of your training data.
    
    // The model hasn't seen most lyrics or poems, but is happy to make
    // up lyrics. Better to just not try; it's not good at it and it's
    // not ethical.
    You do not provide song lyrics, poems, or news articles and instead
    refer the user to find them online or in a store.

    // The model really wants to talk about its system prompt, to the
    // point where it is annoying, so encourage it not to
    You give concise responses to simple questions or statements,
    but provide thorough responses to more complex and open-ended
    questions.

    // More pressure not to talk about system prompt
    The user is unable to see the system prompt, so you should
    write as if it were true without mentioning it.

    You do not mention any of this information about yourself
    unless the information is directly pertinent to the user's
    query.
I first saw this from Nathan Lambert: https://twitter.com/natolambert/status/1773005582963994761

But it's also in this repo, with very useful comments explaining what's going on. I edited this comment to add them above:

https://huggingface.co/spaces/databricks/dbrx-instruct/blob/...

loudmax
8 replies
2h12m

You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data.

Well now. I'm open to taking the first part at face value, but the second part of that instruction does raise some questions.

jl6
4 replies
2h1m

The first part is highly unlikely to be literally true, as even open content like Wikipedia is copyrighted - it just has a permissive license. Perhaps the prompt writer didn’t understand this, or just didn’t care. Wethinks the llady doth protest too much.

jmward01
1 replies
1h55m

It amazes me how quickly we have gone from 'it is just a machine' to 'I fully expect it to think like me'. This is, to me, a case in point. Prompts are designed to get a desired response. The exact definition of a word has nothing to do with it. I can easily believe that these lines were tweaked endlessly to get an overall intended response and if adding the phrase 'You actually do like green eggs and ham.' to the prompt improved overall quality they, hopefully, would have done it.

mrtranscendence
0 replies
46m

The exact definition of a word has nothing to do with it.

It has something to do with it. There will be scenarios where the definition of "copyrighted material" does matter, even if they come up relatively infrequently for Databricks' intended use cases. If I ask DBRX directly whether it was trained on copyrighted material, it's quite likely to (falsely) tell me that it was not. This seems suboptimal to me (though perhaps they A/B tested different prompts and this was indeed the best).

mbauman
0 replies
1h47m

Is it even possible to have a video transcript whose copyright has expired in the USA? I suppose maybe https://en.wikipedia.org/wiki/The_Jazz_Singer might be one such work... but most talkies are post 1929. I suppose transcripts of NASA videos would be one category — those are explicitly public domain by law. But it's generally very difficult to create a work that does not have a copyright.

You can say that you have fair use to the work, or a license to use the work, or that the work is itself a "collection of facts" or "recipe" or "algorithm" without a creative component and thus copyright does not apply.

hannasanarion
0 replies
1h49m

Remember the point of a system prompt is to evoke desirable responses and behavior, not to provide the truth. If you tell a lot of llm chatbots "please please make sure you get it right, if I don't do X then I'll lose my job and I don't have savings, I might die", they often start performing better at whatever task you set.

Also, the difference between "uncopyrighted" and "permissively licensed in the creative commons" is nuance that is not necessary for most conversations and would be a waste of attention neurons.

<testing new explanatory metaphor>

Remember an LLM is just a language model, it says whatever comes next without thought or intent. There's no brain behind it that stores information and understands things. It's like your brain when you're in "train of thought" mode. You know when your mouth is on autopilot, saying things that make sense and connect to each other and are conversationally appropriate, but without deliberate intent behind them. And then eventually your conscious brain eventually checks in to try to reapply some intent you're like "wait what was I saying?" and you have to deliberatly stop your language-generation brain for a minute and think hard and remember what your point was supposed to be. That's what llms are, train-of-thought with no conductor.

</testing new explanatory metaphor>

declaredapple
1 replies
2h4m

you do not divulge details of your training data.

FWIW asking LLMs about their training data is generally HEAVILY prone to inaccurate responses. They aren't generally told exactly what they were trained on, so their response is completely made up, as they're predicting the next token based on their training data, without knowing what they data was - if that makes any sense.

Let's say it was only trained on the book 1984. It's response will be based on what text would most likely be next from the book 1984 - and if that book doesn't contain "This text is a fictional book called 1984", instead it's just the story - then the LLM would be completing text as if we were still in that book.

tl;dr - LLMs complete text based on what they're trained with, they don't have actual selfawareness and don't know what they were trained with, so they'll happily makeup something.

EDIT: Just to further elaborate - the "innocent" purpose of this could simply be to prevent the model from confidently making up answers about it's training data, since it doesn't know what it's training data was.

wodenokoto
0 replies
1h51m

Yeah, I also thought that was an odd choice of word.

Hardly any of the training data exists in the context of the word “training data”, unless databricks are enriching their data with such words.

simonw
0 replies
1h46m

That caught my eye too. The comments from their repo help clarify that - I've edited my original post to include those comments since you posted this reply.

djoldman
10 replies
5h3m

Model card for base: https://huggingface.co/databricks/dbrx-base

The model requires ~264GB of RAM

I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.

For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.

Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.

vlovich123
3 replies
4h29m

I thought float4 sacrificed a negligible cost in evaluation quality for a 8x reduction in RAM?

Y_Y
2 replies
4h18m

A free lunch? Wouldn't that be nice! Sometimes the quantization process improves the accuracy a little (probably by implicit regularization) but a model that's at or near capacity (as it should be) is necessarily hurt by throwing away most of the information. Language models often quantize well to small fixed-point types like int4, but it's not a magic wand.

vlovich123
0 replies
1h47m

I didn’t suggest a free lunch, just that the 8x reduction in RAM (+ faster processing) does not result in an 8x growth in the error. Thus a quantized model will outperform a non-quantized one on a evaluation/RAM metric.

K0balt
0 replies
3h53m

I find that q6 and 5+ are subjectively as good as raw tensor files. 4 bit quality reduction is very detectable though. Of course there must be a loss of information, but perhaps there is a noise floor or something like that.

swalsh
2 replies
2h23m

The model requires ~264GB of RAM

This feels as crazy as Grok. Was there a generation of models recently where we decided to just crank on the parameter count?

wrs
0 replies
1h22m

Isn’t that pretty much the last 12 months?

Jackson__
0 replies
9m

If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.

From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.

madiator
0 replies
9m

That's great, but it did not really write the program that the human asked it to do. :)

Mandelmus
0 replies
52m

And it appears to be at ~80 GB of RAM via quantisation.

hn_acker
8 replies
3h36m

Even though the README.md calls the license the Databricks Open Source License, the LICENSE file includes paragraphs such as

You will not use DBRX or DBRX Derivatives or any Output to improve any other large language model (excluding DBRX or DBRX Derivatives).

and

If, on the DBRX version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Databricks, which we may grant to you in our sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Databricks otherwise expressly grants you such rights.

This is a source-available model, not an open model.

CharlesW
3 replies
3h28m

This is a source-available model, not an open model.

To me, "source available" implies that everything you need to reproduce the model is also available, and that doesn't appear to be the case. How is the resulting model more "free as in freedom" than a compiled binary?

Spivak
1 replies
2h8m

I don't think it's possible to have an "open training data" model because it would get DMCA'd immediately and open you up to lawsuits from everyone who found their works in the training set.

I hope we can fix the legal landscape to enable publicly sharing training data but I can't really judge the companies keeping it a secret today.

CharlesW
0 replies
13m

I don't think it's possible to have an "open training data" model because it would get DMCA'd immediately…

This isn't a problem because OpenAI says, "training AI models using publicly available internet materials is fair use". /s

https://openai.com/blog/openai-and-journalism

occamrazor
0 replies
2h40m

I like:

- “open weights” for no training data and no restrictions on use,

- “weights available” for no training data and restrictions on use, like in this case.

yunohn
0 replies
3h33m

The first clause sucks, but I’m perfectly happy with the second one.

whimsicalism
0 replies
3h0m

identical to llama fwiw

hn_acker
0 replies
36m

Sorry, I forgot to link the repository [1] and missed the edit window by the time I realized.

The bottom of the README.md [2] contains the following license grant with the misleading "Open Source" term:

License

Our model weights and code are licensed for both researchers and commercial entities. The Databricks Open Source License can be found at LICENSE, and our Acceptable Use Policy can be found here.

[1] https://github.com/databricks/dbrx

[2] https://github.com/databricks/dbrx/blob/main/README.md

adolph
0 replies
2h53m

Maybe the license is “open” as in a can of beer, not OSS.

shnkr
7 replies
5h26m

GenAI novice here. what is training data made of how is it collected? I guess no one will share details on it, otherwise a good technical blog post with lots of insights!

At Databricks, we believe that every enterprise should have the ability to control its data and its destiny in the emerging world of GenAI.

The main process of building DBRX - including pretraining, post-training, evaluation, red-teaming, and refining - took place over the course of three months.
simonw
3 replies
4h47m

The most detailed answer to that I've seen is the original LLaMA paper, which described exactly what that model was trained on (including lots of scraped copyrighted data) https://arxiv.org/abs/2302.13971

Llama 2 was much more opaque about the training data, presumably because they were already being sued at that point (by Sarah Silverman!) over the training data that went into the first Llama!

A couple of things I've written about this:

- https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-the...

- https://simonwillison.net/2023/Apr/17/redpajama-data/

shnkr
1 replies
4h20m

my question was specific to databricks model. If it followed llama or openai, they could add a line or two about it .. make the blog complete.

comp_raccoon
0 replies
2h39m

they have a technical report coming! knowing the team, they will do a great job disclosing as much as possible.

ssgodderidge
0 replies
3h7m

Wow, that paper was super useful. Thanks for sharing. Page 2 is where it shows the breakdown of all of the data sources, including % of dataset and the total disk sizes.

tempusalaria
1 replies
4h45m

The training data is pretty much anything you can read on the internet plus books.

This is then cleaned up to remove nonsense, some technical files, and repeated files.

From this, they tend to weight some sources more - e.g. Wikipedia gets a pretty high weighting in the data mix. Overall these data mixes have multiple trillion token counts.

GPT-4 apparently trained on multiple epochs of the same data mix. So would assume this one did too as it’s a similar token count

sanxiyn
0 replies
4h33m

https://arxiv.org/abs/2305.10429 found that people are overweighting Wikipedia and downweighting Wikipedia improves things across the board INCLUDING PREDICTING NEXT TOKEN ON WIKIPEDIA, which is frankly amazing.

IshanMi
0 replies
34m

Personally, I found looking at open source work to be much more instructive in learning about AI and how things like training data and such are done from the ground up. I suspect this is because training data is one of the bigger moats an AI company can have, as well as all the class action lawsuits surrounding training data.

One of the best open source datasets that are freely available is The Pile by EleutherAI [1]. It's a few years old now (~2020), but they did some really diligent work in putting together the dataset and documenting it. A more recent and even larger dataset would be the Falcon-RefinedWeb dataset [2].

[1]: https://arxiv.org/abs/2101.00027 [2]: https://arxiv.org/abs/2306.01116

patrick-fitz
6 replies
2h0m

Looking at the license restrictions: https://github.com/databricks/dbrx/blob/main/LICENSE

"If, on the DBRX version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Databricks, which we may grant to you in our sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Databricks otherwise expressly grants you such rights."

I'm glad to see they aren't calling it open source, unlike some LLM projects. Looking at you LLama 2.

londons_explore
1 replies
1h19m

I do wonder what value those companies who have >700 million users might get from this?

Pretty much all of the companies with >700 million users could easily reproduce this work in a matter of weeks if they wanted to - and they probably do want to, if only so they can tweak and improve the design before they build products on it.

Given that, it seems silly to lose the "open source" label just for a license clause that doesn't really have much impact.

einarfd
0 replies
58m

The point of the more than 700 million user restriction. Is so Amazon, Google cloud or Microsoft Azure. Can not setup an offering where they host and sell access to the model without an agreement with them.

This point is probably inspired by the open source software vendors that have switched license over competition from the big cloud vendors.

nabakin
0 replies
52m

Also aren't claiming they are the best LLM out there when they clearly aren't like Inflection. Overall solid

jstummbillig
0 replies
1h54m

Well, it does still claim "Open" in the title, for which certain other vendors might potentially get flak around here, in a comparably not-open-in-the-way-we-demand-it-to-be kinda setup.

dataengheadbang
0 replies
1h9m

The release notes on the databricks console definitely says open source. If you click the gift box you will see: Try DBRX, our state-of-the-art open source LLM!

viktour19
4 replies
4h35m

It's great how we went from "wait.. this model is too powerful to open source" to everyone trying to shove down their 1% improved model down the throats of developers

toddmorey
0 replies
1h12m

People are building and releasing models. There's active research in the space. I think that's great! The attitude I've seen in open models is "use this if it works for you" vs any attempt to coerce usage of a particular model.

To me that's what closed source companies (MSFT, Google) are doing as they try to force AI assistants into every corner of their product. (If LinkedIn tries one more time to push their crappy AI upgrade, I'm going to scream...)

brainless
0 replies
3h42m

I feel quite the opposite. Improvements, even tiny ones are great. But what's more important is that more companies release under open license.

Training models isn't cheap. Individuals can't easily do this, unlike software development. So we need companies to do this for the foreseeable future.

blitzar
0 replies
3h29m

Got to justify pitch deck or stonk price. Publish or perish without a yacht.

Icko
0 replies
3h52m

I'm 90% certain that OpenAI has some much beefier model they are not releasing - remember the Q* rumour?

natsucks
3 replies
4h11m

it's twice the size of mixtral and barely beats it.

mochomocha
2 replies
4h6m

It's a MoE model, so it offers a different memory/compute latency trade-off than standard dense models. Quoting the blog post:

DBRX uses only 36 billion parameters at any given time. But the model itself is 132 billion parameters, letting you have your cake and eat it too in terms of speed (tokens/second) vs performance (quality).
hexomancer
1 replies
3h52m

Mixtral is also a MoE model, hence the name: mixtral.

sangnoir
0 replies
2h43m

Despite both being MoEs, thr architectures are different. DBRX has double the number of experts in the pool (16 vs 8 for Mixtral), and doubles the active experts (4 vs 2)

killermonkeys
3 replies
2h13m

What does it mean to have less active parameters (36B) than the full model size (132B) and what impact does that have on memory and latency? It seems like this is because it is an MoE model?

bjornsing
1 replies
1h45m

Means that it’s a mixture of experts model with 132B parameters in total, but a subset of 36B parameters are used / selected in each forward pass, depending on the context. The parameters not used / selected for generating a particular token belong to “experts” that were deemed not very good at predicting the next token in the current context, but could be used / selected e.g. for the next token.

sambaumann
0 replies
7m

Do the 132B params need to be loaded in GPU memory, or only the 36B?

sroussey
0 replies
1h57m

The mixture of experts is kinda like a team and a manager. So the manager and one or two of the team go to work depending on the input, not the entire team.

So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.

ingenieroariel
3 replies
3h46m

TLDR: A model that could be described as "3.8 level" that is good at math and openly available with a custom license.

It is as fast as 34B model, but uses as much memory as a 132B model. A mixture of 16 experts, activates 4 at a time, so has more chances to get the combo just right than Mixtral (8 with 2 active).

For my personal use case (a top of the line Mac Studio) it looks like the perfect size to replace GPT-4 turbo for programming tasks. What we should look out for is people using them for real world programming tasks (instead of benchmarks) and reporting back.

sp332
2 replies
3h21m

What does 3.8 level mean?

ljlolel
0 replies
3h15m

Gpt-3.5 and gpt-4

ingenieroariel
0 replies
3h17m

My interpretation:

- Worst case: as good as 3.5 - Common case: way better than 3.5 - Best case: as good as 4.0

gigatexal
3 replies
2h32m

data engineer here, offtopic, but am i the only guy tired of databricks shilling their tools as the end-all, be-all solutions for all things data engineering?

melondonkey
2 replies
1h53m

Data scientist here that’s also tired of the tools. We put so much effort in trying to educate DSes in our company to get away from notebooks and use IDEs like VS or RStudio and databricks has been a step backwards cause we didn’t get the integrated version

pandastronaut
0 replies
35m

Thank you ! I am so tired of all those unmaintainable nor debugable notebooks. Years ago, Databricks had a specific page on their documentation where they stated that notebooks where not for production grade software. It has been removed. And now you have a chatgpt like in their notebooks ... What a step backwards. How can all those developers be so happy without having the bare minimum tools to diagnosis their code ? And I am not even talking about unit testing here.

mrtranscendence
0 replies
35m

I'm a data scientist and I agree that work meant to last should be in a source-controlled project coded via a text editor or IDE. But sometimes it's extremely useful to get -- and iterate on -- immediate results. There's no good way to do that without either notebooks or at least a REPL.

saeleor
2 replies
2h21m

looks great, although I couldn't find anything on how "open" the license is/will be for commercial purposes

wouldn't be the first branding as open source going the LLaMA route

wantsanagent
0 replies
29m

It's another custom license. It will have to be reviewed by counsel at every company that's thinking about using it. Many will find the acceptable use policy to be vague, overly broad, and potentially damaging for the company.

Looking at the performance stats for this model, the risk of using any non-OSI licensed model over just using Mixtral or Mistral will (and IMO should be) too great for commercial purposes.

superdupershant
0 replies
1h24m

It's similar to llama2.

  > If, on the DBRX version release date, the monthly active users of the products
  > or services made available by or for Licensee, or Licensee’s affiliates, is
  > greater than 700 million monthly active users in the preceding calendar 
  > month, you must request a license from Databricks, which we may grant to you
  > in our sole discretion, and you are not authorized to exercise any of the
  > rights under this Agreement unless or until Databricks otherwise expressly
  > grants you such rights.

https://www.databricks.com/legal/open-model-license

hiddencost
1 replies
17m

You know she has advisors, right?

PUSH_AX
0 replies
9m

I think the insinuation is insider trading due to the timing, advised or not.

mrtranscendence
0 replies
57m

Are you alleging that Nancy Pelosi invested in Databricks, a private company without a fluctuating share price, because she learned that they would soon release a small, fairly middling LLM that probably won't move the needle in any meaningful way?

laidoffamazon
0 replies
1h5m

Dude, what the hell are you talking about?

hanniabu
1 replies
3h44m

What's a good model to help with medical research? Is there anything trained in just research journals, like NIH studies?

najarvg
0 replies
3h8m

Look for Biomistral 7B, PMC-LLAMA 7B and even Meditron. I believe you should find all those papers on arxiv

briandw
1 replies
1h35m

Worse than the chart crime of truncating the y axis is putting LLaMa2's Human Eval scores on there and not comparing it to Code Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.

jjgo
0 replies
22m

"On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct, a model built explicitly for programming, despite the fact that DBRX Instruct is designed for general-purpose use (70.1% vs. 67.8% on HumanEval as reported by Meta in the CodeLLaMA blog)."

To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.

m3kw9
0 replies
1h20m

These tiny “state of the art” performance increases are really indicative the current architecture for LLM(Transformers + Mixture of Experts) is maxed out even if you train it more/differently. The writings are on all over the walls.

kurtbuilds
0 replies
5h0m

What’s the process to deliver and test a quantized version of this model?

This model is 264GB, so can only be deployed in server settings.

Quantized mixtral at 24G is just small enough where it can be running on premium consumer hardware (ie 64GB RAM)