The 405b model is actually competitive against closed source frontier models.
Quick comparison with GPT-4o:
+----------------+-------+-------+
| Metric | GPT-4o| Llama |
| | | 3.1 |
| | | 405B |
+----------------+-------+-------+
| MMLU | 88.7 | 88.6 |
| GPQA | 53.6 | 51.1 |
| MATH | 76.6 | 73.8 |
| HumanEval | 90.2 | 89.0 |
| MGSM | 90.5 | 91.6 |
+----------------+-------+-------+
Super cool, though sadly 405b will be outside most personal usage without cloud providers which sorta defeats the purpose of opensource to some extent atleast sadly, because .. nvidia's rampup of consumer VRAM is glacial
You don't need a model of this scale for personal use. Llama 3.1 8B can easily run on your laptop right now. The 70B model can run on a pair of 4090s.
I have the 70b model running quantized just fine on an M1 Max laptop with 64GiB unified RAM. Performance is fine and so far some Q&A tests are impressive.
This is good enough for a lot of use cases... on a laptop. An expensive laptop, but hardware only gets better and cheaper over time.
I don't have the hardware to confirm this, so I'd take it with a grain of salt, but ChatGPT tells me that a maxed out M3 MacBook Pro with 128 GB RAM should be capable of efficiently running Llama 3.1 405B, albeit with essentially no ability to multitask.
(It also predicted that a MacBook Air in 2030 will be able to do the same, and that for smartphones to do the same might take around 20 years.)
I’ve run the Falcon 180B on my M3 Max with 128 GB of memory. I think I ran it at 3-bit. Took a long time to load and was incredibly slow at generating text. Even if you could load the Llama 405B model it would be too slow to be of much use.
Ah, that's a shame to hear. FWIW, ChatGPT did also suggest that there was a lot of room for improvement in the MPS backend of PyTorch that would likely make it more efficient on Apple hardware in time.
You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.
A 405B LLM has 405 billion parameters. If you run it at full "prescision", each parameter takes up 2 bytes, which means you need 810GB of memory. If it does not fit in RAM or GPU memory it will swap to disc and be unusably slow.
You can run the model at reduced prescision to save memory, called quantisation, but this will degrade the quality of the response. The exact amount of degradation depends on the task, the specific model and its size. Larger models seem to suffer slightly less. 1 byte per parameter is pretty much as good as full precision. 4 bits per parameter is still good quality, 3 bits is noticeably worse and 2 bits is often bad to unusable.
With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.
There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.
I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.
The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.
Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.
I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.
Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.
GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.
Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.
Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.
For example here's the list of backends for Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...
Just for reference, the current version of that laptop costs 4800€ (14 inch macbook pro, m3 max, 64gb of ram, 1TB of storage). So price-wise that is more like four laptops.
I think they were referring to the form factor not the price. But even then the price of four laptops is not out of line for enthusiast hobby spending.
Ever priced out a four wheeler, a jet-ski, a filled gun safe, what a "car guy" loses in trade in values every two years, what a hobbyist day-trader is losing before they cut their losses or turn it around, or what a parent who lives vicariously through their child and drags them all over their nearby states for overnight trips so they can do football/soccer/ballet/whatever at 6am on Saturdays against all the other kids who also won't become pro athletes? What about the cost of a wingsuit or getting your pilots license? "Cruisers" or annual-Disney vacationers? If you bought a used CNC machine from a machine shop? But spend five grand on a laptop to play with LLMs and everyone gets real judgmental.
I have the same machine, may I ask which model file and program are you using? Is it partial GPU offload?
You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.
OK, I am curious now: What kind of hardware would I need to run such a model for a couple of users with decent performance?
Where could I get a mapping of token / time vs hardware?
Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.
The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.
Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.
I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.
I don't think this is correct. 5 years power usage of 4090 is $2600 giving TCO of ~$4300. RTX 6000 Ada starts at $6k for the card itself.
https://gpuprices.us
To be fair, you need 2x 4090 to match the VRAM capacity of an RTX 6000 Ada. There is also the rest of the system you need to factor into the cost. When running 10-16x 4090s, you may also need to upgrade your electrical wiring to support that load, you may need to spend more on air conditioning, etc.
I'm not necessarily saying that it's obviously better in terms of total cost, just that there are more factors to consider in a system of this size.
If inference is the only thing that is important to someone building this system, then used 3090s in x8 or even x4 bifurcation is probably the way to go. Things become more complicated if you want to add the ability to train/do other ML stuff, as you will really want to try to hit PCIE 4.0 x16 on every single card.
With 2x 4090 you will have 2x speed of RTX 6000 A. So same energy per token.
Will need more space, true.
Agree 100% that energy costs are important. The example system in my other post would consume somewhere around 300W at idle, 24/7, which is 219 kWh per month, and that's assuming you aren't using the machine at all.
I don't have any actual figures to back this up, but my gut tells me that the fact that enterprise GPUs are an order of magnitude (at least) more expensive than, say a, 3090, means that the payback period of them has got to be pretty long. I also wonder whether setting the max power on a 3090 to a lower than default value (as I suggest in my other post) has a significant effect on the average W/token.
Agreed, but there are other costs associated with supporting 10-16x GPUs that may not necessarily happen with say 6 GPUs. Having to go from single socket (or Threadripper) to dual socket, PCIE bifurcation, PLX risers, etc.
Not necessarily saying that Quadros are cheaper, just that there's more to the calculation when trying to run 405B size models at home
You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel.
Here are some TGI 405B benchmarks that I did with the different quantized models:
https://x.com/danieldekok/status/1815814357298577718
The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model:
https://huggingface.co/blog/synthetic-data-save-costs
how much vram do you need for 4-bit llama 405?
405 billion * 4 bits = approximately 200 GB. Plus extra for the amount of context you want.
100% reddit is full of people trying to solder more vram
I've been wondering if you could just attach a chunk of vram over NVLink, since that's very roughly what FSDP is doing here anyways.
The best NVLINK you can reasonably purchase is for the 3090, which is capped somewhere around 100 Gbit/s. This is too slow. The 3090 has about 1 TB/s memory bandwidth, and the 4090 is even faster, and the 5090 will be even faster.
PCIE 5.0 x16 is 500 Gbit/s if I'm not mistaken, so using RAM is more viable an alternative in this case.
Edit: 3090 has 1 TB/s, not terabits
Great for Groq whos already hosting it but at what cost I guess.
I agree that 405B isn't practical for home users, but I disagree that it defeats the purpose of open source. If you're building a business on inference it can be valuable to run an open model on hardware that you control, without the need to worry that OpenAI or Anthropic or whoever will make drastic changes to the model performance or pricing. Also, it allows the possibility of fine-tuning the model to your requirements. Meta believes it's in their interest to promote these businesses.
I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.
If you think of open source as a protocol through which the ecosystem of companies loosely collaborate, then it's a big deal. E.g. Groq can work on inference without a complicated negotiations with Meta. Ditto for Huggingface, and smaller startups.
I agree with you on open source in the original, home tinkerer sense.
Most SMBs would be able to run it. This is already a huge win for decentralized AI.
The fact that it takes $20k to run your own SOTA model, instead of the $2B+ that it took until yesterday, is significant.
Zoom out a bit. There’s a massive feeder ecosystem around llama. You’ll see many startups take this on and help drive down inference costs for everyone and create competitive pressure that will improve the state of the art.
Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)
"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.
This nodel is not “open source”, free to use maybe.
I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.
As far as I know it's not just the weights. it's everything but the dataset. So the code used to generate the weights is also open source.
Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".
I suppose language changes. I just prefer it changes towards being more precise, not less.
This feels somewhat analogous to games like Quake being open-sourced though still needing the user to provide the original game data files.
But games like Quake are not "open source". They have been open-sourced, specifically the executable parts were, without the assets. This is usually spelled out clearly as the process happen.
In terms of functional role, if we're to compare the models to open-sourced games, then all that's been open-sourced is the trivial[0] bit of code that does the inference.
Maybe a more adequate comparison would be a SoC running a Linux kernel with a big NVidia or Qualcomm binary blob in the middle of it? Sure, the Linux kernel is open source, but we wouldn't call the SoC "open source", because all that makes it what it is (software-side) is hidden in a proprietary binary.
--
[0] - In the sense that there's not much of it, and it's possible to reproduce from papers.
No, the term is fine, “source” in “open source” refers to source code. A dataset by definition is not source code. Stop changing the meaning of words.
A dataset very much is the source code. It's the part that gets turned into the program through an automated process (training is equivalent to compilation).
Academia - nowadays source is needed is a lot of conferences, but the datasets, depending on where/how it might have be obtained, just can't be used or not available and the exact results can't be reproduced.
Not sure if the code is required under an open source license, but it's the same issue.
---
IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.
In this case, the source is there. The output is there, and not technically required. What isn't available is the ability to confirm the output comes from that source. That's not required under open source though.
What's disingenuous is the output being called 'open source'.
Yes its "freeware" or any one of the similar terms we've used to refer to free software.
In other words, it's everything except the one thing that actually matters.
Maybe, but it doesn't mean it's not open source.
The things that don't matter are, the thing that does isn't. Together, they can hardly be called open source.
The dataset is likely absolutely jam packed with copyrighted material that cannot be distributed.
It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.
The weights are, for all practical purposes, source code in their own right. The GPL defines "source code" as "the preferred form of the work for making modifications to it". Almost no one would be capable of reproducing them even if given the source + data. At the same time, the weights are exactly what you need for the one type of modification that's within reach of most people: fine-tuning. That they didn't release the surrounding code that produced this "source" isn't that much different than a company releasing a library but not their whole software stack.
I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".
The Llama 3.1 license [0] specifically forbids its use by very large organizations, by militaries, and by nuclear industries. It also contains a long list of forbidden use cases. This specific list sounds very reasonable to me on its face, but having a list of specific groups of people or fields of endeavor who are banned from participating runs counter to the spirit of open source and opens up the possibility that new "open" licenses come out with different lists of forbidden uses that sound less reasonable.
To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.
Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.
[0] https://github.com/meta-llama/llama-models/blob/main/models/...
Do the costs really matter here? "Weights" are "the preferred form of the work for making modifications to it" in the same sense compiled binary code would be, if for some reason no one could afford to recompile a program from sources.
Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor. Just because that's all anyone except the model vendor is able to do, doesn't merit calling the models "open source", much like no one would call binary-only software "open source" just because reverse engineering is a thing.
No, the weights are just artifacts. The source is the dataset and the training code (and possibly the training parameters). This isn't fundamentally different from running an advanced solver for a year, to find a way to make your program 100 byes smaller so it can fit on a Tamagochi. The resulting binary is magic, can't be reproduced without spending $$$$ on compute for th solver, but it is not open source. The source code is the bit that (produced the original binary that) went into the optimizer.
Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.
--
[0] - https://en.wikipedia.org/wiki/DLL_injection
[1] - https://en.wikipedia.org/wiki/Trainer_(games) - a type of programs popular some 20 years ago, used to cheat at, or mod, single-player games, by keeping track of and directly modifying the memory of the game process. Could be as simple as continuously resetting the ammo counter, or as complex as injecting assembly to add new UI elements.
If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.
The thing is, the core of the GPT architecture is like 40 lines of code. Everyone knows what the source code is basically (minus optimizations). You just need to bring your own 20TB in data, 100k GPUs, and tens of millions in power budget, and you too can train llama 405b.
No, because fine tuning is basically just a continuation of the same process that the original creators used to produce the weights in the first place, in the same way that modifying source code directly is in traditional open source. You pick up where they left off with new data and train it a little bit (or a lot!) more to adapt it to your use case.
The weights themselves are the computer program. There exists no corresponding source code. The code you're asking for corresponds not to the source code of a traditional program but to the programmers themselves and the processes used to write the code. Demanding the source code and data that produced the weights is equivalent to demanding a detailed engineering log documenting the process of building the library before you'll accept it as open source.
Just because you can't read it doesn't make it not source code. Once you have the weights, you are perfectly capable of modifying them following essentially the same processes the original authors did, which are well known and well documented in plenty of places with or without the actual source code that implements that process.
I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.
Its not open source. Your definition would make most video games open source - we modify them all the time. The small runtime framework IS open source but that's not much benefit as you cant really modify it hugely because the weights fix it to an implementation.
No, because most video games aren't licensed in a way that makes that explicitly authorized, nor is modding the preferred form of the work for making modifications. The video game has source code that would be more useful, the model does not have source code that would be more useful than the weights.
When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.
Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".
Hell, in case of the models, "the whole stack to run the software" already is open source. Literally everything except the actual sources - the datasets and the build scripts (code doing the training) - is available openly. This is almost a literal inverse of "open source", thus shouldn't be called "open source".
Training a model is like automatic programming, and the key of it is having a well-organized dataset.
If some "opensource" model just have the model and training methods but no dataset, it’s like some repo which released an executable file with a detailed design doc. Where is the source code? Do it yourself, please.
NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.
It’s not even free to use. There are commercial restrictions.
How do you draw/generate such ascii table?
In the past, I might have used a python library like asciitable to do that.
This time, I just copy pasted the raw metrics I found and asked an LLM to format it as an ASCII table.
Don't know about OP but I generate such tables using Emacs.