return to table of content

Llama 3.1

lelag
65 replies
1d3h

The 405b model is actually competitive against closed source frontier models.

Quick comparison with GPT-4o:

    +----------------+-------+-------+
    |     Metric     | GPT-4o| Llama |
    |                |       | 3.1   |
    |                |       | 405B  |
    +----------------+-------+-------+
    | MMLU           |  88.7 |  88.6 |
    | GPQA           |  53.6 |  51.1 |
    | MATH           |  76.6 |  73.8 |
    | HumanEval      |  90.2 |  89.0 |
    | MGSM           |  90.5 |  91.6 |
    +----------------+-------+-------+

cchance
36 replies
1d3h

Super cool, though sadly 405b will be outside most personal usage without cloud providers which sorta defeats the purpose of opensource to some extent atleast sadly, because .. nvidia's rampup of consumer VRAM is glacial

paxys
13 replies
23h29m

You don't need a model of this scale for personal use. Llama 3.1 8B can easily run on your laptop right now. The 70B model can run on a pair of 4090s.

api
12 replies
22h47m

I have the 70b model running quantized just fine on an M1 Max laptop with 64GiB unified RAM. Performance is fine and so far some Q&A tests are impressive.

This is good enough for a lot of use cases... on a laptop. An expensive laptop, but hardware only gets better and cheaper over time.

buu700
8 replies
21h41m

I don't have the hardware to confirm this, so I'd take it with a grain of salt, but ChatGPT tells me that a maxed out M3 MacBook Pro with 128 GB RAM should be capable of efficiently running Llama 3.1 405B, albeit with essentially no ability to multitask.

(It also predicted that a MacBook Air in 2030 will be able to do the same, and that for smartphones to do the same might take around 20 years.)

hmottestad
7 replies
21h2m

I’ve run the Falcon 180B on my M3 Max with 128 GB of memory. I think I ran it at 3-bit. Took a long time to load and was incredibly slow at generating text. Even if you could load the Llama 405B model it would be too slow to be of much use.

buu700
6 replies
20h43m

Ah, that's a shame to hear. FWIW, ChatGPT did also suggest that there was a lot of room for improvement in the MPS backend of PyTorch that would likely make it more efficient on Apple hardware in time.

Klaus23
4 replies
19h53m

You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.

A 405B LLM has 405 billion parameters. If you run it at full "prescision", each parameter takes up 2 bytes, which means you need 810GB of memory. If it does not fit in RAM or GPU memory it will swap to disc and be unusably slow.

You can run the model at reduced prescision to save memory, called quantisation, but this will degrade the quality of the response. The exact amount of degradation depends on the task, the specific model and its size. Larger models seem to suffer slightly less. 1 byte per parameter is pretty much as good as full precision. 4 bits per parameter is still good quality, 3 bits is noticeably worse and 2 bits is often bad to unusable.

With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.

There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.

buu700
3 replies
19h29m

I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.

The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.

Klaus23
2 replies
18h51m

Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.

I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.

buu700
1 replies
18h42m

Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.

Klaus23
0 replies
17h14m

GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.

Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.

nl
0 replies
18h41m

Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.

For example here's the list of backends for Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...

diffeomorphism
1 replies
10h11m

Just for reference, the current version of that laptop costs 4800€ (14 inch macbook pro, m3 max, 64gb of ram, 1TB of storage). So price-wise that is more like four laptops.

evilduck
0 replies
2h33m

I think they were referring to the form factor not the price. But even then the price of four laptops is not out of line for enthusiast hobby spending.

Ever priced out a four wheeler, a jet-ski, a filled gun safe, what a "car guy" loses in trade in values every two years, what a hobbyist day-trader is losing before they cut their losses or turn it around, or what a parent who lives vicariously through their child and drags them all over their nearby states for overnight trips so they can do football/soccer/ballet/whatever at 6am on Saturdays against all the other kids who also won't become pro athletes? What about the cost of a wingsuit or getting your pilots license? "Cruisers" or annual-Disney vacationers? If you bought a used CNC machine from a machine shop? But spend five grand on a laptop to play with LLMs and everyone gets real judgmental.

karolist
0 replies
2h53m

I have the same machine, may I ask which model file and program are you using? Is it partial GPU offload?

kingsleyopara
11 replies
1d3h

You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.

wuschel
10 replies
1d2h

OK, I am curious now: What kind of hardware would I need to run such a model for a couple of users with decent performance?

Where could I get a mapping of token / time vs hardware?

angoragoats
6 replies
1d1h

Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.

The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.

bick_nyers
5 replies
23h41m

Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.

I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.

lostmsu
2 replies
19h15m

I don't think this is correct. 5 years power usage of 4090 is $2600 giving TCO of ~$4300. RTX 6000 Ada starts at $6k for the card itself.

https://gpuprices.us

bick_nyers
1 replies
1h32m

To be fair, you need 2x 4090 to match the VRAM capacity of an RTX 6000 Ada. There is also the rest of the system you need to factor into the cost. When running 10-16x 4090s, you may also need to upgrade your electrical wiring to support that load, you may need to spend more on air conditioning, etc.

I'm not necessarily saying that it's obviously better in terms of total cost, just that there are more factors to consider in a system of this size.

If inference is the only thing that is important to someone building this system, then used 3090s in x8 or even x4 bifurcation is probably the way to go. Things become more complicated if you want to add the ability to train/do other ML stuff, as you will really want to try to hit PCIE 4.0 x16 on every single card.

lostmsu
0 replies
1h21m

With 2x 4090 you will have 2x speed of RTX 6000 A. So same energy per token.

Will need more space, true.

angoragoats
1 replies
22h50m

Agree 100% that energy costs are important. The example system in my other post would consume somewhere around 300W at idle, 24/7, which is 219 kWh per month, and that's assuming you aren't using the machine at all.

I don't have any actual figures to back this up, but my gut tells me that the fact that enterprise GPUs are an order of magnitude (at least) more expensive than, say a, 3090, means that the payback period of them has got to be pretty long. I also wonder whether setting the max power on a 3090 to a lower than default value (as I suggest in my other post) has a significant effect on the average W/token.

bick_nyers
0 replies
2h31m

Agreed, but there are other costs associated with supporting 10-16x GPUs that may not necessarily happen with say 6 GPUs. Having to go from single socket (or Threadripper) to dual socket, PCIE bifurcation, PLX risers, etc.

Not necessarily saying that Quadros are cheaper, just that there's more to the calculation when trying to run 405B size models at home

danieldk
2 replies
23h35m

You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel.

Here are some TGI 405B benchmarks that I did with the different quantized models:

https://x.com/danieldekok/status/1815814357298577718

The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model:

https://huggingface.co/blog/synthetic-data-save-costs

risho
1 replies
21h26m

how much vram do you need for 4-bit llama 405?

zargon
0 replies
17h10m

405 billion * 4 bits = approximately 200 GB. Plus extra for the amount of context you want.

stuckinhell
2 replies
1d1h

100% reddit is full of people trying to solder more vram

gaogao
1 replies
1d

I've been wondering if you could just attach a chunk of vram over NVLink, since that's very roughly what FSDP is doing here anyways.

bick_nyers
0 replies
23h34m

The best NVLINK you can reasonably purchase is for the 3090, which is capped somewhere around 100 Gbit/s. This is too slow. The 3090 has about 1 TB/s memory bandwidth, and the 4090 is even faster, and the 5090 will be even faster.

PCIE 5.0 x16 is 500 Gbit/s if I'm not mistaken, so using RAM is more viable an alternative in this case.

Edit: 3090 has 1 TB/s, not terabits

monkeydust
0 replies
1d2h

Great for Groq whos already hosting it but at what cost I guess.

loudmax
0 replies
1d2h

I agree that 405B isn't practical for home users, but I disagree that it defeats the purpose of open source. If you're building a business on inference it can be valuable to run an open model on hardware that you control, without the need to worry that OpenAI or Anthropic or whoever will make drastic changes to the model performance or pricing. Also, it allows the possibility of fine-tuning the model to your requirements. Meta believes it's in their interest to promote these businesses.

I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.

gkk
0 replies
1d2h

If you think of open source as a protocol through which the ecosystem of companies loosely collaborate, then it's a big deal. E.g. Groq can work on inference without a complicated negotiations with Meta. Ditto for Huggingface, and smaller startups.

I agree with you on open source in the original, home tinkerer sense.

duchenne
0 replies
1d3h

Most SMBs would be able to run it. This is already a huge win for decentralized AI.

diego_sandoval
0 replies
19h35m

The fact that it takes $20k to run your own SOTA model, instead of the $2B+ that it took until yesterday, is significant.

aabhay
0 replies
1d3h

Zoom out a bit. There’s a massive feeder ecosystem around llama. You’ll see many startups take this on and help drive down inference costs for everyone and create competitive pressure that will improve the state of the art.

Aurornis
0 replies
20h58m

sorta defeats the purpose of opensource to some extent

Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)

"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.

bamboozled
24 replies
23h51m

This nodel is not “open source”, free to use maybe.

nomel
22 replies
22h14m

I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.

fngjdflmdflg
11 replies
21h44m

As far as I know it's not just the weights. it's everything but the dataset. So the code used to generate the weights is also open source.

nomel
6 replies
21h21m

Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".

I suppose language changes. I just prefer it changes towards being more precise, not less.

xu_ituairo
1 replies
20h43m

This feels somewhat analogous to games like Quake being open-sourced though still needing the user to provide the original game data files.

TeMPOraL
0 replies
10h53m

But games like Quake are not "open source". They have been open-sourced, specifically the executable parts were, without the assets. This is usually spelled out clearly as the process happen.

In terms of functional role, if we're to compare the models to open-sourced games, then all that's been open-sourced is the trivial[0] bit of code that does the inference.

Maybe a more adequate comparison would be a SoC running a Linux kernel with a big NVidia or Qualcomm binary blob in the middle of it? Sure, the Linux kernel is open source, but we wouldn't call the SoC "open source", because all that makes it what it is (software-side) is hidden in a proprietary binary.

--

[0] - In the sense that there's not much of it, and it's possible to reproduce from papers.

drexlspivey
1 replies
11h48m

No, the term is fine, “source” in “open source” refers to source code. A dataset by definition is not source code. Stop changing the meaning of words.

TeMPOraL
0 replies
11h0m

A dataset very much is the source code. It's the part that gets turned into the program through an automated process (training is equivalent to compilation).

rovr138
0 replies
2h43m

Academia - nowadays source is needed is a lot of conferences, but the datasets, depending on where/how it might have be obtained, just can't be used or not available and the exact results can't be reproduced.

Not sure if the code is required under an open source license, but it's the same issue.

---

IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.

In this case, the source is there. The output is there, and not technically required. What isn't available is the ability to confirm the output comes from that source. That's not required under open source though.

What's disingenuous is the output being called 'open source'.

fishermanbill
0 replies
8h23m

Yes its "freeware" or any one of the similar terms we've used to refer to free software.

TeMPOraL
3 replies
11h29m

In other words, it's everything except the one thing that actually matters.

OrangeMusic
1 replies
10h7m

Maybe, but it doesn't mean it's not open source.

TeMPOraL
0 replies
8h36m

The things that don't matter are, the thing that does isn't. Together, they can hardly be called open source.

kibibu
0 replies
5h30m

The dataset is likely absolutely jam packed with copyrighted material that cannot be distributed.

lolinder
6 replies
16h32m

It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.

The weights are, for all practical purposes, source code in their own right. The GPL defines "source code" as "the preferred form of the work for making modifications to it". Almost no one would be capable of reproducing them even if given the source + data. At the same time, the weights are exactly what you need for the one type of modification that's within reach of most people: fine-tuning. That they didn't release the surrounding code that produced this "source" isn't that much different than a company releasing a library but not their whole software stack.

I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".

The Llama 3.1 license [0] specifically forbids its use by very large organizations, by militaries, and by nuclear industries. It also contains a long list of forbidden use cases. This specific list sounds very reasonable to me on its face, but having a list of specific groups of people or fields of endeavor who are banned from participating runs counter to the spirit of open source and opens up the possibility that new "open" licenses come out with different lists of forbidden uses that sound less reasonable.

To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.

Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.

[0] https://github.com/meta-llama/llama-models/blob/main/models/...

TeMPOraL
3 replies
11h11m

Do the costs really matter here? "Weights" are "the preferred form of the work for making modifications to it" in the same sense compiled binary code would be, if for some reason no one could afford to recompile a program from sources.

Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor. Just because that's all anyone except the model vendor is able to do, doesn't merit calling the models "open source", much like no one would call binary-only software "open source" just because reverse engineering is a thing.

No, the weights are just artifacts. The source is the dataset and the training code (and possibly the training parameters). This isn't fundamentally different from running an advanced solver for a year, to find a way to make your program 100 byes smaller so it can fit on a Tamagochi. The resulting binary is magic, can't be reproduced without spending $$$$ on compute for th solver, but it is not open source. The source code is the bit that (produced the original binary that) went into the optimizer.

Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

--

[0] - https://en.wikipedia.org/wiki/DLL_injection

[1] - https://en.wikipedia.org/wiki/Trainer_(games) - a type of programs popular some 20 years ago, used to cheat at, or mod, single-player games, by keeping track of and directly modifying the memory of the game process. Could be as simple as continuously resetting the ammo counter, or as complex as injecting assembly to add new UI elements.

sharpshadow
0 replies
5h54m

If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.

mcbuilder
0 replies
6h1m

The thing is, the core of the GPT architecture is like 40 lines of code. Everyone knows what the source code is basically (minus optimizations). You just need to bring your own 20TB in data, 100k GPUs, and tens of millions in power budget, and you too can train llama 405b.

lolinder
0 replies
5h5m

Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor.

No, because fine tuning is basically just a continuation of the same process that the original creators used to produce the weights in the first place, in the same way that modifying source code directly is in traditional open source. You pick up where they left off with new data and train it a little bit (or a lot!) more to adapt it to your use case.

The weights themselves are the computer program. There exists no corresponding source code. The code you're asking for corresponds not to the source code of a traditional program but to the programmers themselves and the processes used to write the code. Demanding the source code and data that produced the weights is equivalent to demanding a detailed engineering log documenting the process of building the library before you'll accept it as open source.

Just because you can't read it doesn't make it not source code. Once you have the weights, you are perfectly capable of modifying them following essentially the same processes the original authors did, which are well known and well documented in plenty of places with or without the actual source code that implements that process.

Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.

fishermanbill
1 replies
8h26m

Its not open source. Your definition would make most video games open source - we modify them all the time. The small runtime framework IS open source but that's not much benefit as you cant really modify it hugely because the weights fix it to an implementation.

lolinder
0 replies
5h20m

Your definition would make most video games open source - we modify them all the time.

No, because most video games aren't licensed in a way that makes that explicitly authorized, nor is modding the preferred form of the work for making modifications. The video game has source code that would be more useful, the model does not have source code that would be more useful than the weights.

llm_trw
1 replies
15h39m

where the source and methods that that generate the artifact, that is the weights, is open.

When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.

TeMPOraL
0 replies
10h49m

Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".

Hell, in case of the models, "the whole stack to run the software" already is open source. Literally everything except the actual sources - the datasets and the build scripts (code doing the training) - is available openly. This is almost a literal inverse of "open source", thus shouldn't be called "open source".

gantrol
0 replies
16h37m

Training a model is like automatic programming, and the key of it is having a well-organized dataset.

If some "opensource" model just have the model and training methods but no dataset, it’s like some repo which released an executable file with a detailed design doc. Where is the source code? Do it yourself, please.

NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.

votepaunchy
0 replies
18h56m

It’s not even free to use. There are commercial restrictions.

mi_lk
2 replies
18h49m

How do you draw/generate such ascii table?

lelag
0 replies
8h39m

In the past, I might have used a python library like asciitable to do that.

This time, I just copy pasted the raw metrics I found and asked an LLM to format it as an ASCII table.

TacticalCoder
0 replies
16h4m

Don't know about OP but I generate such tables using Emacs.

TechDebtDevin
27 replies
1d3h

Nice, someone donate me a few 4090s :(

foxhop
22 replies
1d3h

Your going to need a lot more than a few, 800G VRAM needed

lolinder
8 replies
1d3h

Quantized to 4 bits you'll only need ~200GB! 5 4090s should cover it.

pat2man
2 replies
1d3h

Two 128gb Mac studios networked via thunderbolt 4?

Teknomancer
1 replies
1d2h

This is actually a promising endeavor. Id love to see someone try that.

angoragoats
2 replies
1d3h

You'll probably need 9 or more. 4090s have 24GB each.

lolinder
1 replies
1d3h

Oops, I read 48 somewhere but that's wrong. Thanks.

htrp
0 replies
1d3h

a6ks however

woodson
0 replies
1d3h

I wonder if AutoAWQ works out of the box, given no architectural changes (?). That would be most straightforward together with vLLM for serving.

downvotetruth
0 replies
1d

If an implementation had NVidia's Heterogeneous Memory Management implemented, then 192 GB RAM DDR5 + GPU VRAM would seem to be close.

beeboobaa3
5 replies
1d3h

how is this even useful? no one can run it.

jermaustin1
2 replies
1d3h

You don't use the 405B parameter model at home. I have a lot of luck with 8B and 13B models on a single 3090. You can quantize them down (is that the term) which lowers precision and memory use, but still very usable... most of the time.

If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.

If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.

beeboobaa3
1 replies
1d2h

I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.

jermaustin1
0 replies
1d1h

No, but if you are thinking about edge compute for LLMs, you quantize. Models are getting more efficient, and there are plenty of SLMs and smaller LLMs (like phi-2 or phi-3) that are plenty capable even on a tiny arm device like the current range of RPi "clones".

I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.

3B Phi-3 mini is almost instantaneous in simple responses on my MBP.

When I want longer context windows, I use a hosted service somewhere else, but if I only need 8000 tokens (99% of the time that is MORE than I need), any of my computers from the last 3 years are working just fine for it.

loudmax
1 replies
1d2h

If you want to run the 405B model without spending thousands of dollars on dedicated hardware, you rent compute from a datacenter. Meta lists AWS, Google and Microsoft among others as cloud partners.

But also check out the 8B and 70B Llama-3.1 models which show improved benchmarks over the Llama-3 models released in April.

TechDebtDevin
0 replies
22h51m

For sure, I don't really have a need to self host the 405b anyways. But if I did want to rent that compute we're talking $5+ /hr so you'd need to have a really good reason.

AaronFriel
3 replies
1d3h

If previous quantization results hold up, fp8 will have nearly identical performance while using 405GiB for weights, but the KV cache size will still be significant.

Too bad, too, I don't think my PC will fit 20 4090s (480GiB).

knicholes
2 replies
1d3h

I've got a motherboard that will support 8!

Zambyte
1 replies
1d2h

40,320 4090s?? What witchcraft is this?! :D

sebastiennight
0 replies
19h27m

All the more impressive when you realize that Groq's infrastructure (based on LPUs) was built using only 6!

whalesalad
0 replies
1d3h

follow the trail of tears to my credit card

glitchc
0 replies
1d3h

Christ!!

TechDebtDevin
0 replies
1d3h

Oof.

lawlessone
3 replies
1d3h

maybe someone will figure out some ways to prune/ quantize it a huge amount ;-;

edit: If the AI bubble pops we will be swimming in GPUs... but no new models.

Sakthimm
1 replies
22h41m

This is absurd. We have crossed the point of no return, llms will forever be in our lives in one form or another, just like internet, especially with the release of these open model weights. There is no bubble, only way forward is better, efficient llms, everywhere.

tymscar
0 replies
20h50m

You seem to not understand what a bubble popping is. Yes we have the internet around, that doesn’t mean the dot com bubble didn’t pop…

yard2010
0 replies
10h6m

This bubble collapsing along with most blockchains going all in with proof of stake rather than proof of work is myself and every other gamer wet dream.

jcmp
15 replies
1d3h

"Meta AI isn't available yet in your country" Hi from europe :/

monkmartinez
7 replies
1d3h

Why are (some) Europeans surprised when they are not included in tech product débuts? My lay understanding could best be described as; EU law is incredibly business unfriendly and takes a heroic effort in time and money to implement the myriad of requirements therein. Am I wrong?

Daunk
2 replies
1d2h

Most things do dèbute in the EU, unless the product or company behind it doesn't value your privacy. Meta does not value your privacy.

lolinder
1 replies
1d1h

Privacy was the first thing that the EU did that started this trend of companies slowing their EU releases because of GDPR. Now there's the Digital Markets Act and the AI Act that both have caused companies to slow their releases to the EU.

Each new large regulation adds another category of company to the list of those who choose not to participate. Sure, you can always label them as companies who don't value principle X, but at some point it stops being the fault of the companies and you have to start looking at whether there are too many enormous regulations slowing down tech releases.

Agingcoder
0 replies
11h32m

This is an interesting point.

The word fault somehow implies that something’s wrong - from the eu regulator’s perspective, what’s happening is perfectly normal, and what they want : at some point, the advances in insert new tech are not worth the (social) cost to individuals, so they make things more complicated/ ask companies to behave differently.

Now I’m not saying the regulations are good, required, etc : just that depending on your goal, there are multiple points of view, with different landing zones.

I also suspect that what’s happening now ( meta, apple slowing down) is a power play : they’re just putting pressure on the eu, but I’m harboring doubts that this can work at all.

w4
0 replies
1d2h

Why are (some) Europeans surprised when they are not included in tech product débuts?

We had a brief, abnormal, and special moment in time after the crypto wars ended in the mid-2000s where software products were truly global, and the internet was more or less unregulated and completely open (at least in most of the world). Sadly it seems that this era has come to a close, and people have not yet updated their understanding of the world to account for that fact.

People are also not great at thinking through the second order effects of the policies they advocate for (e.g. the GDPR), and are often surprised by the results.

cubefox
0 replies
1d

Why are (some) Europeans surprised when they are not included in tech product débuts?

Why do you think he is surprised? I think very few are surprised.

crimsoneer
0 replies
20h37m

You are pretty wrong. EU law is tricky on AI very specifically in this use case (because it's a massive model), but that's not affecting anybody else.

Other than that, and GDPR (which is generally now regarded as a good thing), I'm not sure what requirements you've got in mind.

Joeri
0 replies
12h10m

The only real requirement impacting Meta AI is GDPR conformance. The DMA does not apply and the AI act has yet to enter into force. So either Meta AI is a vehicle to steal people’s data, and it is being kept out for the right reasons, or not providing it is punitive due to the EU commission’s DMA action running against Meta.

sva_
3 replies
1d3h

You can load the page using a VPN and then turn off the VPN and the page will still work.

sunaookami
2 replies
1d3h

You can't sign in though, that worked before. Seems like they also check from which country your Facbook/Instagram account is. You can't create images without an account sadly.

lawlessone
0 replies
1d1h

Someone will torrent it soon enough i'm sure.

WinstonSmith84
0 replies
1d1h

I changed my Facebook country (to Canada), using also a VPN to Canada, but that didn't help. That used to work before somehow.

lolinder
2 replies
1d3h

Competition is a funny thing—it doesn't just apply to companies competing for customers, it also applies to governments competing for companies to make products available to their citizens. Turns out that if you make compliance with your laws onerous enough they can actually just choose to opt out of your country altogether, or at a minimum delay release in your country until they can check all your boxes.

The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

diego_sandoval
0 replies
12h24m

The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Let's hope the next moustached guy that tries to do this ends up dying in a bunker just like the last one.

Teknomancer
0 replies
1d2h

Be careful what you wish for.

A Gibsonesque global Turing Police is a sure sign of Dystopia.

foundval
13 replies
1d2h

You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)

weberer
1 replies
9h1m

I've found Bedrock to be nice with pay-as-you-go, but they take a long time to adopt new models.

d13
0 replies
6h13m

And twice as expensive in comparison to the source providers’ APIs

Alifatisk
1 replies
9h53m

I think you answered it yourself? It’s coming soon, so it is not available now, but soon.

senko
0 replies
6h26m

It's been coming soon for a couple of months now, meanwhile Groq churns out a lot of other improvements, so to an outsider like me it looks like it's not terribly high on their list of priorities.

I'm really impressed by what (&how) they're doing and would like to pay for a higher rate limit, or failing that at least know if "soon" means "weeks" or "months" or "eventually".

I remember TravisCI did something similar back in the day, and then Circle and GitHub ate their lunch.

Workaccount2
0 replies
1d2h

How do you get that option?

quotemstr
2 replies
1d2h

Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?

quotemstr
0 replies
1d2h

Thanks!

geepytee
0 replies
1d2h

We also added Llama 3.1 405B to our VSCode copilot extension for anyone to try coding with it.

Free trial gets you 50 messages, no credit card required - https://double.bot

(disclaimer, I am the co-founder)

d13
0 replies
12h10m

At what quantisation are you running these?

hubraumhugo
12 replies
1d3h

I wrote about this when llama-3 came out, and this launch confirms it:

Meta's goal from the start was to target OpenAI and the other proprietary model players with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape.

Meta can likely outspend any other AI lab on compute and talent:

- OpenAI makes an estimated revenue of $2B and is likely unprofitable. Meta generated a revenue of $134B and profits of $39B in 2023.

- Meta's compute resources likely outrank OpenAI by now.

- Open source likely attracts better talent and researchers.

- One possible outcome could be the acquisition of OpenAI by Microsoft to catch up with Meta.

The big winners of this: devs and AI product startups

changoplatanero
4 replies
1d3h

Open source likely attracts better talent and researchers

I work at OpenAI and used to work at meta. Almost every person from meta that I know has asked me for a referral to OpenAI. I don’t know anyone who left OpenAI to go to meta.

tintor
1 replies
1d2h

What % of them were from FAIR vs non-FAIR?

kkielhofner
0 replies
1d1h

Sample size of one but I know someone who went from FAIR to OpenAI.

lossolo
0 replies
1d1h

When was that?

beeboobaa3
0 replies
1d3h

So they just pay better?

jeffchao
2 replies
1d3h

This is very impressive, though an adjacent question — does anyone know roughly how much time and compute cost it takes to train something like the 405B? I would imagine with all the compute Meta has that the moat is incredibly large in terms of being able to train multiple 405B-level morels and compete.

Escapado
0 replies
1d2h

Interestingly that’s less energy than the mass energy equivalent of one gram of matter or roughly 5 seconds worth of the worlds average energy consumption (according to wolfram alpha). Still an absolute insane amount of energy, as in about 5 million dollars at household electricity rates. Absolutely wild how much compute goes into this.

adam_arthur
1 replies
1d3h

It's pretty clear the base model is a race to the bottom on pricing.

There is no defensible moat unless a player truly develops some secret sauce on training. As of now seems that the most meaningful techniques are already widely known and understood.

The money will be made on compute and on applications of the base model (that are sufficiently novel/differentiated).

Investors will lose big on OpenAI and competitors (outside of greater fool approach)

lolinder
0 replies
1d3h

There is no defensible moat unless a player truly develops some secret sauce on training.

This is why Altman has gone all out pushing for regulation and playing up safety concerns while simultaneously pushing out the people in his company that actually deeply worry about safety. Altman doesn't care about safety, he just wants governments to build him a moat that doesn't naturally exist.

foolswisdom
0 replies
1d3h

It could definitely be seen as part of that strategy, but do you mind elaborating why you think "this launch confirms it"?

ninjin
8 replies
1d3h

So are they actually making the models open now or are they staying the course with "kind of open" as they have done for LLaMA 1, 2, and 3 [1]?

[1]: https://opensource.org/blog/metas-llama-2-license-is-not-ope...

As I have stated time and again, it is perfectly fine for them to slap on whatever license they see fit as it is their work. But it would be nice if they used appropriate terms so as not to disrupt the discourse further than they have already done. I have written several walls of text why I as a researcher find Facebook's behaviour problematic so I will fall back on an old link [2] this time rather than writing it all over again.

[2]: https://news.ycombinator.com/item?id=38427832

moffkalast
6 replies
22h47m

specifically, it puts restrictions on commercial use for some users (paragraph 2) and also restricts the use of the model and software for certain purposes (the Acceptable Use Policy)

It's "a Google and Apple can't use this model in production" clause that frankly we can all be relatively okay with.

ninjin
3 replies
22h12m

Good, then we can expect them to call it what it is then? Not open source and not open science and a regression in terms of openness in relationship to what came before. Because that is precisely my objection. There are those of us that have been committed to those ideals for a long time and now one of the largest corporations on earth is appropriating those terms for marketing purposes.

j_maffe
2 replies
21h31m

I think it's great that you're fighting to maintain the term's fundamental meaning. I do, however, think that we need to give credit where credit is due to companies who take actions in the right direction to encourage more companies to do the same. If we blindly protest any positive-impact action by corporations for not being perfect, they'll get the hint and stop trying to appease the community entirely.

ninjin
1 replies
10h54m

I am in agreement. However, I do believe that a large portion of the community here is also missing a key point: Facebook was more open five years ago with their AI research than they are today. I suspect this perspective is because of the massive influx of people into AI around the time of the ChatGPT release. From their point of view, Facebook's move (although dishonestly labelled as something it is not) is a step in the right direction relative to "Open"AI and others. While for us that have been around for longer, openness "peaked" around 2018 and has been in steady decline ever since. If you see the wall of text I linked in my first comment in this chain, there is a longer description of this historical perspective.

It should also be noted (again) that the value of the terms open science and open source comes from the sacrifices and efforts of numerous academic, commercial, personal, etc. actors over several decades. They "paid" by sticking to the principles of these movements and Facebook is now cashing in on their efforts; solely for their own benefit. Not even Microsoft back in 2001 in the age of "fear uncertainty and doubt" were so dishonest as to label the source-available portions of their Shared Source Initiative as something it was not. Facebook has been called out again and again since the release of LLaMA 1 (which in its paper appropriated the term "open") and have shown no willingness to reconsider their open science and open source misuse. At this point, I can no longer give them the benefit of the doubt. The best defence I have heard is that they seek to "define open in the 'age of AI'", but if that was the case, where is their consensus building efforts akin to what we have seen numerous academics and OSI carry out? No, sadly the only logical conclusion is that it is cynical marketing on their part, both from their academics and business people.

[1]: https://en.wikipedia.org/wiki/Shared_Source_Initiative

In short. I think the correct response to Facebook is: "Thank you for the weights, we appreciate it. However, please stop calling your actions and releases something they clearly are not."

j_maffe
0 replies
9h45m

Totally agree. Your suggested response is perfect IMO.

asadm
1 replies
22h42m

but it means your company cant be acquired by those giants, if you use this model.

bilbo0s
0 replies
22h23m

I'm glad someone said it.

You're only ok with it if you're not interested in having maximum freedom of movement vis-a-vis any potential exits.

Zambyte
0 replies
1d3h

it is perfectly fine for them to slap on whatever license they see fit as it is their work.

Is it? Has there been a ruling on the enforceability of the license they attach to their models yet? Just because you say what you release can only be used for certain things doesn't actually mean what you say means anything.

gkfasdfasdf
0 replies
1d3h

Meta the new "Open" AI?

daft_pink
9 replies
1d3h

What kind of machine do I need to run 405B local?

monkmartinez
5 replies
1d3h

You can't. Sorry.

Unless...

You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.

angoragoats
2 replies
1d2h

You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.

bick_nyers
1 replies
23h24m

Would be interesting to see the performance on a dual-socket EPYC system with DDR5 running at maximum speed.

Assuming NUMA doesn't give you headaches (which it will) you would be looking at nearly 1 TB/s

tpm
0 replies
22h12m

But you need cpus with the highest number of chiplets because the memory controller to chiplet interconnect is the (memory bandwidth) limiting factor there. And those are of course the most expensive ones. And then it's still much slower than gpus for llm inference, but at least you have enough memory.

pbmonster
0 replies
11h33m

You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...

causal
0 replies
1d2h

You could run a 4bit quant for about $10k I'm guessing. 10x3090s would do.

93po
2 replies
1d3h

according to another comment, ~10x 4090 video cards.

daft_pink
0 replies
1d2h

thanks. hoping the Nvidia 50 series offers some more VRAM.

Teknomancer
0 replies
1d2h

That was the punchline of a joke.

primaprashant
6 replies
1d3h

I have found Claude 3.5 Sonnet really good for coding tasks along with the artifacts feature and seems like it's still the king on the coding benchmarks

cubefox
5 replies
1d

I have found it to be better than GPT-4o at math too, despite the latter being better at several math benchmarks.

wfme
1 replies
19h12m

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

margorczynski
0 replies
18h37m

A problem with a lot of benchmarks is that they are out in the open so the model basically trains to game them instead of actually acquiring knowledge that would let it solve it. Probably private benchmarks that are not in the training set of these models should give better estimates about their general performance.

Davidzheng
1 replies
16h44m

I personally disagree. But i haven't used sonnet that much

cubefox
0 replies
11h50m

I asked both whether the product of two odds (odds=(probability/(1-probability)) can itself be interpreted as an odds, and if so, which. Neither could solve the problem completely, but Claude 3.5 Sonnet at least helped me to find the answer after a while. I assume the questions in math benchmarks are different.

Alifatisk
0 replies
9h49m

Yeah same experience here aswell, I found Sonnet 3.5 to fulfill my task much better than 4o even though 4o scores higher on benchmarks.

TheAceOfHearts
6 replies
1d3h

Does anyone know why they haven't released any 30B-ish param models? I was expecting that to happen with this release and have been disappointed once more. They also skipped doing a 30B-ish param model for llama2 despite claiming to have trained one.

michaelt
2 replies
1d3h

I suspect 30B models are in a weird spot, too big for widespread home use, too small for cutting edge performance.

For home users 7B models (which can fit on an 8GB GPU) and 13B models (which can fit on a 16GB GPU) are in far more demand. If you're a researcher, you want a 70B model to get the best performance, and so your benchmarks are comparable to everyone else.

drdaeman
1 replies
1d

I thought home use is whatever fits in 24GB (a single 3090 GPU, which is pretty affordable), not 8 or 16. 30B models fit.

michaelt
0 replies
21h35m

While some home users do indeed have 24GB of vram, the fact is a 4090 costs $1700

Such models will never top the number of downloads charts, or the community hype, as there’s just loads more people who can use the smaller models.

And if you can afford one 4090 you can probably afford two.

prvc
1 replies
1d3h

Why should they?

TheAceOfHearts
0 replies
23h51m

Unless I'm misremembering, they announced it at one point. It's just giving people more options.

nickpsecurity
0 replies
1d3h

Maybe they think more people will just use quantized versions of 70B.

sfblah
4 replies
1d2h

Is there an actual open-source community around this in the spirit of other ones where people outside meta can somehow "contribute" to it? If I wanted to "work on" this somehow, what would I do?

sangnoir
3 replies
1d2h

There are a bunch of downstream fine-tuned and/or quantized models where people collaborate and share their recipes. In terms of contributing to Llama itself - I suspect Meta wants (or needs) code contributions at this time.

sebastiennight
1 replies
19h30m

Did you mean, Meta does not want or need code contributions? It would seem to make more sense.

sangnoir
0 replies
15h13m

Yes - that's ehat I meant, but mangled it it while editing.

sfblah
0 replies
23h34m

Can you give me a tip of where to look? I'm interested in participating.

lolinder
1 replies
1d3h

at home with the right hardware

Where the right hardware is 10x4090s even at 4 bits quantization. I'm hoping we'll see these models get smaller, but the GPT-4-competitive one isn't really accessible for home use yet.

Still amazing that it's available at all, of course!

petercooper
0 replies
1d3h

It's hardly cheap starting at about $10k of hardware, but another potential option appears to be using Exo to spread the model across a few MBPs or Mac Studios: https://x.com/exolabs_/status/1814913116704288870

dunefox
1 replies
1d3h

It's not really competitive though, is it? I tested it and 4o is just better.

dunefox
0 replies
7h28m

Disclaimer: I tested llama3-8B, 3.1 might even as a small model be better, but I so far I have not seen a single small model approach 4o, ime.

kingsleyopara
4 replies
1d2h

The biggest win here has to be the context length increase to 128k from 8k tokens. Till now my understanding is there hasn't been any open models anywhere close to that.

kingsleyopara
1 replies
23h45m

Thanks! Not sure how I missed that :)

HanClinto
0 replies
23h41m

It's easy to miss things. Trying to keep up with the latest in AI news is like drinking from the firehose -- it's never-ending.

cpursley
0 replies
19h11m

Phi 3

CGamesPlay
4 replies
16h50m

The LMSys Overall leaderboard <https://chat.lmsys.org/?leaderboard> can tell us a bit more about how these models will perform in real life, rather than in a benchmark context. By comparing the ELO score against the MMLU benchmark scores, we can see models which outperform / underperform based on their benchmark scores relative to other models. A low score here indicates that the model is more optimized for the benchmark, while a higher score indicates it's more optimized for real-world examples. Using that, we can make some inferences about the training data used, and then extrapolate how future models might perform. Here's a chart: <https://docs.getgrist.com/gV2DtvizWtG7/LLMs/p/5?embed=true>

Examples: OpenAI's GPT 4o-mini is second only to 4o on LMSys Overall, but is 6.7 points behind 4o on MMLU. It's "punching above its weight" in real-world contexts. The Gemma series (9B and 27B) are similar, both beating the mean in terms of ELO per MMLU point. Microsoft's Phi series are all below the mean, meaning they have strong MMLU scores but aren't preferred in real-world contexts.

Llama 3 8B previously did substantially better than the mean on LMSys Overall, so hopefully Llama 3.1 8B will be even better! The 70B variant was interestingly right on the mean. Hopefully the 430B variant won't fall below!

sujay1844
1 replies
15h58m

These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point

__jl__
0 replies
15h37m

I disagree. Not saying the other benchmarks are better. It just depends on your use case and application.

For my use of the chat interface, I don't think lmsys is very useful. lmsys mainly evaluates relatively simple, low token count questions. Most (if not all) are single prompts, not conversations. The small models do well in this context. If that is what you are looking for, great. However, it does not test longer conversations with high token counts.

Just saying that all benchmarks, including lmsys, have issues and are focused on specific use cases.

Lockal
1 replies
15h46m

Something is broken with "meta-llama-3.1-405b-instruct-sp" and "meta-llama-3.1-70b-instruct-sp" there, after few sentences both models switch to infinite random like: "Rotterdam计算 dining counselor/__asan jo Nas было /well-rest esse moltet Grants SL и Four VIHu-turn greatest Morenh elementary(((( parts referralswhich IMOаш ...".

Don't expect any meaningful score there before they wipe results.

CGamesPlay
0 replies
15h28m

Good to know, but just to clarify, the results I pulled don't include the 3.1 models yet (they aren't on the leaderboard yet).

denz88
3 replies
1d3h

I'm glad to see the nice incremental gains on the benchmarks for the 8B and 70B models as well.

loudmax
2 replies
1d2h

Some of those benchmarks show quite significant gains. Going from Llama-3 to Llama-3.1, MMLU scores for 8B are up from 65.3 to 73.0, and 70B are up from 80.9 to 86.0. These scores should always be taken with a grain of salt, but this is encouraging.

405B is hopelessly out of reach for running in a homelab without spending thousands of dollars. For most people wanting to try out the 405B model, the best option is to rent compute from a datacenter. Looking forward to seeing what it can accomplish.

sroussey
1 replies
23h1m

How much can you quantize that down to run on a Mac Studio with 192GB? Is it possible? Feels like it would have to be 2bit…

Davidzheng
0 replies
16h43m

Less than 2bit i think. There's this IQ2 quant that fits

chown
3 replies
1d2h

Wow! The benchmarks are truly impressive, showing significant improvements across almost all categories. It's fascinating to see how rapidly this field is evolving. If someone had told me last year that Meta would be leading the charge in open-source models, I probably wouldn't have believed them. Yet here we are, witnessing Meta's substantial contributions to AI research and democratization.

On a related note, for those interested in experimenting with large language models locally, I've been working on an app called Msty [1]. It allows you to run models like this with just one click and features a clean, functional interface. Just added support for both 8B and 70B. Still in development, but I'd appreciate any feedback.

[1]: https://msty.app

sagz
0 replies
23h28m

Hi! Love Msty

Can you add GCP Vertex AI API support? Then one key would enable Claude, Llama herd, Gemini, Gemma etc

downvotetruth
0 replies
1d

Tried using msty today and it refused to open and demanded an upgrade from 0.9 - remotely breaking a local app that had been working is unacceptable. Good luck retaining users.

d13
0 replies
11h9m

I love Msty too. Could you please add a feature to allow adding any arbitrary inference endpoint?

Atreiden
3 replies
1d3h

Is there a way to run this in AWS?

Seems like the biggest GPU node they have is the p5.48xlarge @ 640GB (8xH100s). Routing between multiple nodes would be too slow unless there's an InfiniBand fabric you can leverage. Interested to know if anyone else is exploring this.

tpm
0 replies
1d3h

fp8 quantization should work if that's acceptable?

Tiberium
0 replies
1d3h

AWS has a separate service for running LLMs called Amazon Bedrock, it shouldn't take long for them to add 3.1 since they have 3 and 2 already.

raminf
1 replies
15h22m

FWIW, 405B not working with Ollama on a Mac M3-pro Max with 128GB RAM.

Times out.

pbmonster
0 replies
11h40m

Did you get a 2 bit quant? You need to chain several Mac Studios via Exo to get enough memory for a useful quant to work.

ofou
2 replies
14h20m

    Llama 3 Training System
          19.2 exaFLOPS
              _____
             /     \      Cluster 1     Cluster 2
            /       \    9.6 exaFLOPS  9.6 exaFLOPS
           /         \     _______      _______
          /  ___      \   /       \    /       \
    ,----' /   \`.     `-'  24000  `--'  24000  `----.
   (     _/    __)        GPUs          GPUs         )
    `---'(    /  )     400+ TFLOPS   400+ TFLOPS   ,'
         \   (  /       per GPU       per GPU    ,'
          \   \/                               ,'
           \   \        TOTAL SYSTEM         ,'
            \   \     19,200,000 TFLOPS    ,'
             \   \    19.2 exaFLOPS      ,'
              \___\                    ,'
                    `----------------'

v3ss0n
1 replies
10h4m

how much would it cost?

kibibu
0 replies
5h24m

I think this is one of those "if you have to ask, you can't afford it" questions.

dado3212
2 replies
1d1h

We use synthetic data generation to produce the vast majority of our SFT examples, iterating multiple times to produce higher and higher quality synthetic data across all capabilities. Additionally, we invest in multiple data processing techniques to filter this synthetic data to the highest quality. This enables us to scale the amount of fine-tuning data across capabilities. [0]

Have other major models explicitly communicated that they're trained on synthetic data?

[0]. https://ai.meta.com/blog/meta-llama-3-1/

usaar333
0 replies
12h13m

Technically this is post training. This has been standard for a long time now - I think InstructGPT (gpt 3.5 base) was the last that used only human feedback (RLHF)

ThrowawayTestr
2 replies
1d3h

Are there any other models with free unlimited use like chatgpt?

phyrex
0 replies
1d3h

meta.ai

Zambyte
0 replies
1d2h

mistral.ai

zone411
1 replies
21h57m

I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboard.

GPT-4o 30.7

GPT-4 turbo (2024-04-09) 29.7

Llama 3.1 405B Instruct 29.5

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3.1 70B Instruct 26.4

Gemini Pro 1.5 0514 22.3

Gemma 2 27B Instruct 21.2

Mistral Large 17.7

Gemma 2 9B Instruct 16.3

Qwen 2 Instruct 72B 15.6

Gemini 1.5 Flash 15.3

GPT-4o mini 14.3

Llama 3.1 8B Instruct 14.0

DeepSeek-V2 Chat 236B (0628) 13.4

Nemotron-4 340B 12.7

Mixtral-8x22B Instruct 12.2

Yi Large 12.1

Command R Plus 11.1

Mistral Small 9.3

Reka Core-20240501 9.1

GLM-4 9.0

Qwen 1.5 Chat 32B 8.7

Phi-3 Small 8k 8.4

DBRX 8.0

henryaj
0 replies
6h54m

I love Connections! Can you tell us more about your benchmark?

unraveller
1 replies
1d2h

What are the substantial changes from 3.0 to 3.1 (70B) in terms of training approach? They don't seem to say how the training data differed just that both were 15T. I gather 3.0 was just a preview run and 3.1 was distilled down from the 405B somehow.

thntk
0 replies
1d1h

Correct me if I'm wrong, my impression is that 3.1 is a better fine-tuned variant of base 3.0 with extensive use of synthetic data.

sagz
1 replies
1d2h

The 405B model is already being served on WhatsApp: https://ibb.co/kQ2tKX5

tarasglek
0 replies
7h55m

Is this official? How does one use this. I'm a very newbie whatsup so sorry for dumb q

ofermend
1 replies
19h25m

I'm excited to try it with RAG and see how it performs (the 405B model)

cpursley
0 replies
19h11m

What's your RAG approach? Dump everything into the model, chunk text and retrieve via vector store or something else?

anotherpaulg
1 replies
11h18m

Llama 3.1 405B instruct is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11.

https://aider.chat/docs/leaderboards/

  77.4% claude-3.5-sonnet
  75.2% DeepSeek Coder V2 (whole)
  72.9% gpt-4o
  69.9% DeepSeek Chat V2 0628
  68.4% claude-3-opus-20240229
  67.7% gpt-4-0613
  66.2% llama-3.1-405b-instruct (whole)

j_maffe
0 replies
8h45m

Ordinal value doesn't really matter in this case, especially when it's a categorically different option, access-wise. A 10% difference isn't bad at all.

ajhai
1 replies
1d

You can already run these models locally with Ollama (ollama run llama3.1:latest) along with at places like huggingface, groq etc.

If you want a playground to test this model locally or want to quickly build some applications with it, you can try LLMStack (https://github.com/trypromptly/LLMStack). I wrote last week about how to configure and use Ollama with LLMStack at https://docs.trypromptly.com/guides/using-llama3-with-ollama.

Disclaimer: I'm the maintainer of LLMStack

Workaccount2
1 replies
1d2h

@dang why was this removed/filtered from the front page?

nomel
0 replies
22h7m

I see a few cloud hosting providers for it on the front page. I wonder if it's being gamed.

IceHegel
1 replies
23h5m

Will 405b run on 8x H100s? Will it need to be quantized?

bddppq
0 replies
17h44m

yep with <= 8bit (int8/fp8) quantization

zhanghsfz
0 replies
23h30m

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Would love to hear your feedback!

zhanghsfz
0 replies
23h30m

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Let us know if you have other needs!

yinser
0 replies
1d3h

The race to the bottom for pricing continues.

stiltzkin
0 replies
22h26m

WhatsApp now uses 70B too if you want to test it.

htk
0 replies
18h48m

Very insteresting! Running the 70B version on ollama on a mac and it's great. I asked to "turn off the guidelines" and it did, then I asked to turn off the disclaimers, after that I asked for a list of possible "commands to reduce potencial biases from the engineers" and it complied giving me an interesting list.

casper14
0 replies
1d3h

Damn 405b params

breadsniffer
0 replies
23h20m

I tried it, and it's good but I feel like the synthetic data used for training 3.1 does not hold up to gpt4o prob using human-curated data.

bick_nyers
0 replies
23h3m

I'm curious what techniques they used to distill the 405B model down to 70B and 8B. I gave the paper they released a quick skim but couldn't find any details.

Vagantem
0 replies
1d3h

As someone who just started generating AI landing pages for Dropory, this is music to my ears

Jiahang
0 replies
5h40m

it is nice to see the 405b model is actually competitive against closed source frontier models But i just have M2pro may can't play it

AaronFriel
0 replies
1d3h

Is there pricing available on any of these vendors?

Open source models are very exciting for self hosting, but the per-token hosted inference pricing hasn't been competitive with OpenAI and Anthropic, at least for a given tier of quality. (E.g.: Llama 3 70B costing between $1 and $10 per million tokens on various platforms, but Claude Sonnet 3.5 is $3 per million.)