Meta AI releases Code Llama 70B

Not sure who this is aimed at? The avg programmer probably doesn’t have the gear on hand to run this at the required pace

Cool nonetheless

You can run it on a Macbook M1/M2 with 64GB of RAM.

How? It's larger than 64GB.

Quantization is highly effective at reducing memory and storage requirements, and it barely has any impact on quality (unless you take it to the extreme). Approximately no one should ever be running the full fat fp16 models during inference of any of these LLMs. That would be incredibly inefficient.

I run 33B parameter models on my RTX 3090 (24GB VRAM) no problem. 70B should easily fit into 64GB of RAM.

I'm aware but is it still LLaMA 70B at that point?

It's a legit question, the model will be worse in some way... I've seen it discussed that all things being equal more parameters is better (meaning it's better to take a big model and quantized it to fit in memory than use a smaller unquantized model that fits), but a quantized model wouldn't be expected to run identically to or as well as the full model.

You don’t stop being andy99 just because you’re a little tired, do you? Being tired makes everyone a little less capable at most things. Sometimes, a lot less capable.

In traditional software, the same program compiled for 32-bit and 64-bit architectures won’t be able to handle all of the same inputs, because the 32-bit version is limited by the available address space. It’s still the same program.

If we’re not willing to declare that you are a completely separate person when you’re tired, or that 32-bit and 64-bit versions are completely different programs, then I don’t think it’s worth getting overly philosophical about quantization. A quantized model is still the same model.

The quality loss from using 4+ bit quantization is minimal, in my experience.

Yes, it has a small impact on accuracy, but with massive efficiency gains. I don’t really think anyone should be running the full models outside of research in the first place. If anything, the quantized models should be considered the “real” models, and the full fp16/fp32 model should just be considered a research artifact distinct from the model. But this philosophical rabbit hole doesn’t seem to lead anywhere interesting to me.

Various papers have shown that 4-bit quantization is a great balance. One example: https://arxiv.org/pdf/2212.09720.pdf

I don't like the metaphor: when I'm tired, I will be alert again later. Quantization is lossy compression: the human equivalent would be more like a traumatic brain injury affecting recall, especially of fine details.

The question of whether I am still me after a traumatic brain injury is philosophically unclear, and likely depends on specifics about the extent of the deficits.

The impact on accuracy is somewhere in the single-digit percentages at 4-bit quantization, from what I’ve been able to gather. Very small impact. To draw the analogy out further, if the model was able to get an A on a test before quantization, it would likely still get a B at worst afterwards, given a drop in the score of less than 10%. Depending on the task, the measured impact could even be negligible.

It’s far more similar to the model being perpetually tired than it is to a TBI.

You may nitpick the analogy, but analogies are never exact. You also ignore the other piece that I pointed out, which is how we treat other software that comes in multiple slightly different forms.

But we’re talking about a coding LLM here. A single digit percentage reduction in accuracy means, what, one or two times in a hundred, it writes == instead of !=?

I think that’s too simplified. The best LLMs will still frequently make mistakes. Meta is advertising a HumanEval score of 67.8%. In a third of cases, the code generated still doesn’t satisfactorily solve the problem in that automated benchmark. The additional errors that quantization would introduce would only be a very small percentage of the overall errors, making the quantized and unquantized models practically indistinguishable to a human observer. Beyond that, lower accuracy can manifest in many ways, and “do the opposite” seems unlikely to be the most common way. There might be a dozen correct ways to solve a problem. The quantized model might choose a different path that still turns out to work, it’s just not exactly the same path.

As someone else pointed out, FLAC is objectively more accurate than mp3, but how many people can really tell? Is it worth 3x the data to store/stream music in FLAC?

The quantized model would run at probably 4x the speed of the unquantized model, assuming you had enough memory to choose between them. Is speed worth nothing? If I have to wait all day for the LLM to respond, I can probably do the work faster myself without its help. Is being able to fit the model onto the hardware you have worth nothing?

In essence, quantization here is a 95% “accurate” implementation of a 67% accurate model, which yields a 300% increase in speed while using just 25% of the RAM. All numbers are approximate, even the HumanEval benchmark should be taken with a large grain of salt.

If you have a very opulent computational experience, you can enjoy the luxury of the full 67.8% accurate model, but that just feels both wasteful and like a bad user experience.

Reminds me of the never ending MP3 vs FLAC argument.

The difference can be measured empirically, but is it noticeable in real world usage.

Sure, quantization reduces information stored for each parameter, not the parameter count.

Yes. Quantization does not reduce the number of parameters. It does not re-train the model.

Can I ask how many tok/s you're getting on that setup? I'm trying to decide whether to invest in a high-end NVIDIA setup or a Mac Studio with llama.cpp for the purposes of running LLMs like this one locally.

On a 33B model at q4_0 quantization, I’m seeing about 36 tokens/s on the RTX 3090 with all layers offloaded to the GPU.

Mixtral runs at about 43 tokens/s at q3_K_S with all layers offloaded. I normally avoid going below 4-bit quantization, but Mixtral doesn’t seem phased. I’m not sure if the MoE just makes it more resilient to quantization, or what the deal is. If I run it at q4_0, then it runs at about 24 tokens/s, with 26 out of 33 layers offloaded, which is still perfectly usable, but I don’t usually see the need with Mixtral.

Ollama dynamically adjusts the layers offloaded based on the model and context size, so if I need to run with a larger context window, that reduces the number of layers that will fit on the GPU and that impacts performance, but things generally work well.

What's the power consumption and fan noise like when doing that? I assume you're running the model doing inference in the background for the whole coding session, i.e. hours at a time?

I don’t use local LLMs for CoPilot-like functionality, but I have toyed with the concept.

There are a few things to keep in mind: no programmer that I know is sitting there typing code for hours at a time without stopping. There’s a lot more to being a developer than just typing, whether it is debugging, thinking, JIRA, Slack, or whatever else. These CoPilot-like tools will only activate after you type something, then stop for a defined timeout period. While you’re typing, they do nothing. After they generate, they do nothing.

I would honestly be surprised if the GPU active time was more than 10% averaged over an hour. When actively working on a large LLM, the RTX 3090 is drawing close to 400W in my desktop. At a 10% duty cycle (active time), that would be 40W on average, which would be 320Wh over the course of a full 8-hour day of crazy productivity. My electric rate is about 15¢/kWh, so that would be about 5¢ per day. It is absolutely not running at a 100% duty cycle, and it’s absurd to even do the math for that, but we can multiply by 10 and say that if you’re somehow a mythical “10x developer” then it would be 50¢/day in electricity here. I think 5¢/day to 10¢/day is closer to reality. Either way, the cost is marginal at the scale of a software developer’s salary.

That sounds perfectly reasonable. I'm more worried about noise and heat than the cost though, but I guess that's not too bad either then. What's the latency like? When I've used generative image models the programs unload the model after they're done, so it takes a while to generate the next image. Is the model sitting in VRAM when it's idle?

Fan noise isn’t very much, and you can always limit the max clockspeeds on a GPU (and/or undervolt it) to be quieter and more efficient at a cost of a small amount of performance. The RTX 3090 still seems to be faster than the M3 Max for LLMs that fit on the 3090, so giving up a little performance for near-silent operation wouldn’t be a big loss.

Ollama caches the last used model in memory for a few minutes, then unloads it if it hasn’t been used in that time to free up VRAM. I think they’re working on making this period configurable.

Latency is very good in my experience, but I haven’t used the local code completion stuff much, just a few quick experiments on personal projects, so my experience with that aspect is limited. If I ever have a job that encourages me to use my own LLM server, I would certainly consider using it more for that.

Thanks! That is really fast for personal use.

Here's an example of megadolphin running on my m2 ultra setup: https://gist.github.com/nullstyle/a9b68991128fd4be84ffe8435f...

I run LLaMA 70B and 120B (frankenmerges) locally on a 2022 Mac Studio with M1 Ultra and 12Gb RAM. It gives ~7 tok/s for 120B and ~9.5 tok/s for 70B.

Note that M1/M2 Ultra is quite a bit faster than M3 Max, mostly due to 800 Gb/s vs 400 Gb/s memory

Quantization can take it under 30GB (with quality degradation).

For example, take a look at the GGUF file sizes here: https://huggingface.co/TheBloke/Llama-2-70B-GGUF

I am not too familiar with LLMs and GPUs (Not a gamer either). But want to learn.

Could you please expand on what else would be capable of running such models locally?

How about a linux laptop/desktop with specific hardware configuration?

It pretty much comes down to 2 factors which is memory bandwidth and compute. You need a high enough memory bandwidth to be able to "feed" the compute and you need beefy enough compute to be able to keep up with the data that is being fed in by the memory. In theory a single Nvidia 4090 would be able to run a 70b model with quantization at "useable" speeds. The reason mac hardware is so capable in AI is because of the unified architecture meaning the memory is shared across the GPU and CPU. There are other factors but it essentially comes down to tokens per second advantages. You could run one of these models on an old GPU with low memory bandwidth just fine but your tokens per second would be far too slow for what most people consider "useable" and the quantization necessary might star noticeably effecting the quality.

A single RTX 4090 can run at most 34b models with 4-bit quantization. You'd need 2-bit for 70b, and at that point quality plummets.

Compute is actually not that big of a deal once generation is ongoing, compared to memory bandwidth. But the initial prompt processing can easily be an order of magnitude slower on CPU, so for large prompts (which would be the case for code completion), acceleration is necessary.

Thats a good point.

For example both the RTX 4090 and the RTX 6000 Ada Generation use the AD102 chip. The RTX 6000 Ada though, would be able to run 70b models due to the larger memory pool despite having the same memory interface width.

There are no other laptops on the market with as much VRAM as the 64GB MBP's have as far as I know. You could make a Linux desktop computer with two 3090's, linked together giving 48GB of VRAM. Wich apparently can run a 4-bit quantized 6k context 70B llama model.

People are recommending Macbooks because they're a relatively cheap and easy way to get a very large amount of RAM hooked up to your accelerator.

Note that these are quantized versions of the model, so they're not as good as the original 70B model, though people claim their performance is really close to original performance. To run without quantization you'd need about 140GB of VRAM. Which would only be possible with an NVidia H100 (don't know the price) or two A100's (at $18,000 each).

It's aimed at OpenAI's moat. Making sure they don't accumulate too much of one. No one actually has to use this, it just needs to be clear that LLM as a service won't be super high margin because competition can simply start building on Meta's open source releases.

The moat is all but guaranteed to be the scale of the GPUs required to operate these for a lot of users as they get ever larger, specifically the extreme cost that is going along with that.

Anybody have $10 billion sitting around to deploy that gigantic open source set-up for millions of users? There's your moat and only a relatively few companies will be able to do it.

One of Google's moats is, has been, and will always be the scale required to just get into the search game and the tens of billions of dollars you need to compete in search effectively (and that's before you get to competing with their brand). Microsoft has spent over a hundred billion dollars trying to compete with Google, and there's little evidence anybody else has done better anywhere (Western Europe hasn't done anything in search, there's Baidu out of China, and Yandex out of Russia).

VRAM isn't moving nearly as fast as the models are progressing in size. And it's never going to. The cost will get ever greater to operate these at scale.

Unless someone sees a huge paradigm change for cheaper, consumer accessible GPUs in the near future (Intel? AMD? China?). As it is, Nvidia owns the market and they're part of the moat cost problem.

The moat is all but guaranteed to be the scale of the GPUs required to operate these

You don't have to run them locally.

VRAM isn't moving nearly as fast as the models are progressing in size.

Models of any given quality are declining in size (both number of parameters, and also VRAM required for inference per parameter because quantization methods are improving.)

VRAM isn't moving nearly as fast as the models are progressing in size. And it's never going to. The cost will get ever greater to operate these at scale.

It is in at least 2025. AMD (and Intel, maybe) will have M-Pro-Esque APUs that can run a 70B model at very reasonable speeds.

I am pretty sure Intel is going to rock the VRAM boat on desktops as well. They literally have no market to lose, unlike AMD which infuriatingly still artificially segments their high VRAM cards.

and this is why the LLM arms race for ultra high parameter count models will stagnate. It's all well and good that we're developing interesting new models. But once you factor cost into the equation it does severely limit what applications justify the cost.

Raw FLOPs may increase each generation but VRAM becomes a limiting factor. And fast VRAM is expensive.

I do expect to see incremental innovation in reducing the size of foundational models.

The moat is all but guaranteed to be the scale of the GPUs required to operate these for a lot of users

for end users, yes. For small companies that want to finetune, evaluate and create derivatives, it reduces the cost by millions.

So. Strange as it seems, is Meta being more 'Open', than OpenAI that was created to be the 'open' option to fight off Meta and Google?

Sometimes you arrive at your intended solution in a roundabout way

Meta is becoming the good guy. Its actually a smart move. Some extra reputation points wont hurt Meta.

If Meta can turn the money making sauce in GenAI from model+data to just data then it's in a very good position. Meta has tons of data.

How feasible would it be too fine tune using internal code and have an enterprise copilot.

We actually run already in-house ollama server prototype for coding assistance with deepseek coder and it is pretty good. Now if we would get a model for this, that is on chatgpt 4 level, I would be super happy.

Did you finetune a model?

No, we went with RAG pipeline approach as we assume things change too fast.

Thanks! Any details how you chunk and find the relevant code?

Or how you deal with context length? I.e. do you send anything other than the current file? How is the prompt constructed?

Considering a number of Saas offer this service, I'd say it's feasible.

already been done

I got it to build and run the example app on my M3 max with 36 gb ram. Memory pressure was around 32 gb

Did you quantise it? At what level and what was your impression compared to other recent smaller models at that quantisation, if so?

No I just ran it out of the box but I had to modify the source code to run for Mac.

Instructions here: https://github.com/facebookresearch/llama/pull/947/

Yeah, but if your company wants to rent an H100, you can deploy this for your developers for much less than the cost of a developer…

I have a multi-GPU rig designed exactly for this purpose :) Check out r/localllama. There are literally dozens of us!

There are companies like phind that offer copilot-like services using finetuned versions of CodeLlama-34B, which imo are actually good. But I do not know if such a larger model is gonna be used in such a context.

This is targeted towards GPU rental services like RunPod as well as API providers such as together AI. Together.ai is charging $0.90/1M tokens at 70B parameters. https://www.together.ai/pricing

People who want to host the models presumably, AWS bedrock will def include it.

Can someone explain Meta's strategy with the open source models here? Genuine question, I don't fully undestand.

(Please don't say "commoditize your complement" without explaining what exactly they're commoditizing...)

Total speculation: Yann LeCun is there and he is really passionate about the technology and openness

I doubt personal passions would merit the company funding required for such big models.

Given how megacorps spend millions on a whim (Disney with all recent flops) or, when just a single person wants it (Ms Flight Simulator?) - I wouldn't be surprised to be honest...

But sure, sounds more reasonable

Disney didn't spend millions on a whim. It's just the reality of box office that even millions in investment are no guarantee for returns.

Financially, they have underperformed significantly over longer period of time (10 years):

  For shareholders, this subpar performance has destroyed value. Disney stock has underperformed the stocks
  of Disney’s self-selected proxy peers and the broader market over every relevant period during the last
  decade and during the tenure of each non-management director. Furthermore, it has underperformed since
  Bob Iger was first appointed CEO in 2005 – a period during which he has served as CEO or Executive
  Chairman (directing the Company’s creative endeavors in this role) for all but 11 months. Disney shareholders
  were once over $200 billion wealthier than they are now

Which is radically different from previous 90 years

https://trianpartners.com/wp-content/uploads/2023/12/Trian-N...

Share price isn't the be all end all.

Disney has steamrolled Hollywood for the last decade, bringing in by far the biggest global box office revenue in 7 consecutive years out of 8. They have more billion dollar box office movies than every other studio co mbined. This kind of dominance was unheard of in the history of Hollywood.

Setting box office aside, Disney revenue has tripled since Iger took over and is twice as much as it should be adjusted for inflation.

The idea that the company has underperformed for the last 10 years or that they spend millions "on a whim" is a joke. And using share price as some justification is even more absurd, share price was double what it was today just in 2021.

The idea that the company has underperformed for the last 10 years ... is a joke

Did you even read the Triad Partners quote from their letter? It's their words, not mine.

There is no quote that says that.

"Earnings per share (“EPS”) in the most recent fiscal year were lower than the EPS generated by Disney a decade ago"

is not the same as "underpeforming for a decade".

all that says is that EPS is currently low, not that it has been low and reducing/stagnant for a decade.

The faux-open models mean the models can't be used in competing products. The open code base means enthusiasts and amateurs and other people hack on Meta projects and contribute improvements.

They get free R&D and suppress competition, while looking like they have principles. Yann is clueless about open source principles, or the models would have been Apache or some other comparably open license. It's all ruthless corporate strategy, regardless of the mouth noises coming out of various meta employees.

The faux-open models mean the models can't be used in competing products.

Just because certain entities can't profitably use a product or obtain a license doesn't make it not-open. AGPL is open, for an extreme example.

This argument is also subjective, and not new - "Which is more open BSD-style licenses or GPL?" has ben a guaranteed flameware starter for decades.

I'm not arguing about BSD or GPL. I'm saying that the "open source code, proprietary binary blob" pattern Meta is running with is about quashing potential competition, market positioning, and corporate priorities over any tangential beneficial contributions to open source AI.

It's shitty when other companies do it. It's shitty when Broadcom does it. It's shitty when Meta does it.

It's never a not shitty thing to do.

Meta's choice of license doesn't indicate that Yann is clueless about open-source principles. I don't know about Meta specifically, but in most companies choosing the license for open source projects involves working with a lot of different stakeholders. He very easily could have pushed for Apache or MIT and some other interest group within Meta vetoed it.

Meta doesn't have an AI "product" competing with OpenAI, Google's Bard, etc. But they use AI extensively internally. This is roughly a byproduct of their internal AI work that they're already doing, and fostering open source AI development puts incredible pressure on the AI products and their owners.

If Meta can help prevent there from being an AI monopoly company, but rather an ecosystem of comparable products, then they avoid having another threatening tech giant competitor, as well as preventing their own AI work and products from being devalued.

Think of it like Google releasing a web browser.

Google releasing a (very popular) web browser gives them direct control of web standards. What does this give Facebook?

I think we should not underestimate the strategic talent acquisition value as well. Many top-tier AI engineers may appreciate the openness and choose to join meta, which could be very valuable in the long run.

Excellent point -- goodwill in a hyper-high demand dev community is invaluable.

Web standards are probably the last thing Google cares about with Chrome. Much more important is being the default search engine and making sure data collection isn't interrupted by a potential privacy minded browser.

OP already mentioned that it adds additional hurdles for possible future tech giants to have to cross on their quest.

It's akin to a Great Filter, if such an analogy helps. If Meta's open models make a company's closed models uneconomical for others to consume, then the business case for those models is compromised and the odds of them growing to a size where they can compete with Meta in other ways is mitigated a bit.

Meta's end goal is to have better AI than everyone else, in the medium term that means they want to have the best foundational models. How does this help.

1. They become an attractive place for AI researchers to work, and can bring in better staff. 2. They make it less appealing for startups to enter the space and build large foundation models (Meta would prefer 1,000 startups pop up and play around with other people's models, than 1000 startups popping up and trying to build better foundational models). 3. They put cost pressure on AI as a service providers. When LLAMA exists it's harder for companies to make a profit just selling access to models. Along with 2 this further limits the possibility of startups entering the foundational model space, because the path to monetization/breakeven is more difficult.

Essentially this puts Meta, Google, and OpenAI/Microsoft (Anthropic/Amazon as a number four maybe) as the only real players in the cutting edge foundational model space. Worst case scenario they maintain their place in the current tech hegemony as newcomers are blocked from competing.

Essentially this puts Meta, Google, and OpenAI/Microsoft (Anthropic/Amazon as a number four maybe) as the only real players in the cutting edge foundational model space.

Mistral is right up there.

Mistral has ~20 employees. I'm sure they have good researchers, but don't they lack the computing and engineering resources the big actors have?

I'm curious to see how they go, I might have a limited understanding. From what I can tell they do a good job in terms of value and efficiency with 'lighter' models, but I don't put them in the same category as the others in the sense that they aren't producing the massive absolute best in class LLMs.

Hopefully they can prove me wrong though!

They're commoditizing the ability to generate viral content, which is the carrot that keeps peoples' eyeballs on the hedonic treadmill. More eyeball-time = more ad placements = more money.

On the advertiser side, they're commoditizing the ability for companies to write more persuasively-targeted ads. Higher click-through rates = more money.

[edit]: For models that generate code instead of content (TFA), it's obviously a different story. I don't have a good grip on that story, beyond "they're using their otherwise-idle GPU farms to buy goodwill and innovate on training methods".

That stuff ultimately drives people away. Who thinks "I need my daily fix of genAI memes, let me head to Facebook!"?

People on HN are not representative of the average Facebook user.

Yan Le Cunn has talked about Meta's strategy with open source. The general idea, is that the smartest people in the world do not work for you. No company can replicate innovation from open source internally.

The general idea, is that the smartest people in the world do not work for you

Most likely, they work for your competitors. They may not be working to improve your system for free.

No company can replicate innovation from open source internally.

Lot of innovation does come from companies.

Lot of innovation does come from companies.

Of course, i am not arguing that. But when it comes to software as general as code generation, or text generation, the possible applications are so broad, that a team of A.I. researchers in a company, however talented and productive they are, cannot possibly optimize it for every possible use case.

That's what Yan Le Cunn is referring to, and i agree with him. There are a lot of companies which push deep learning forward, and do not release their code or weights freely.

Aside from the "positive" explanations offered in the sibling comments, there's also a "negative" one: other AI companies that try to enter the fray will not be able to compete with Meta's open offerings. After all, why would you pay a company to undertake R&D on building their own models when you can just finetune a Llama?

Whatever Meta's motivation is they help diversify models suppliers. Which is a good thing not to be locked in. As usual reality is more complicated with many moving part. Free models may undercut small startups. But at the same time they stimulate secondary market of providers and tuners.

Bill Gurly has a good perspective on it.

Essentially, you mitigate IP claims and reduce vendor dependency.

https://eightify.app/summary/technology-and-software/the-imp...

AI seems like the Next Big Thing. Meta have put themselves at the center of the most exciting growth area in technology by releasing models they have trained.

They've gained an incredible amount of influence and mindshare.

To be crowned the harbinger of AGI.

Really enjoying how many different answers you got.

(My theory: if there's an AI pot of gold, what megacorp can risk one of the others getting to it first?)

AI puts pressure on search, cutting into google's ad revenue. Meta's properties are less immune to pressure from AI.

My opinion is Meta is taking the model out of the secret sauce formula. That leaves hardware and data for training as the barrier to entry. If you don't need to develop your own model then all you need is data and hardware which lowers the barrier to entry. The lower the barrier the more GenAI startups and the more potential data customers for Meta since they certainly have large, curated, datasets for sale.

Part of it is that they already had this developed for years (see alt text on uploaded images for example), and they want to ensure that new regulations don't hamper any of their future plans.

It costs them nothing to open it up, so why not. Kinda like all the rest of their GitHub repos.

Facebook went all in on the metaverse and turned into Meta; quite rightly, the market looked at what they produced for 10's of billions and decided their company was worthless.

Then Ai sprung to the front pages and any CEO who stood up and said "Ai" was rewarded with a 10x stock price. The unloved stepchild that was the ML team became the A team and the metaverse team have been sent to the naughty step. Facebook/Meta have no actual customer facing use for Ai unlike Microsoft/Google/GitHub but they like a good stonk price rise and so what we see is their stategy to stay in the ai game and relevant.

It turns out it is pretty good for the rest of us (possibly the first time facebook has given something positive to humanity) as we get shinny toys to play with.

OT: You „don‘t“ or you „don’t fully“ understand? ;)

(I try to train myself to say it right ..)

I think a big part of it is just because they have a big AI lab. I don't know the genesis of that, but it has for years been a big contributor, see pytorch, models like SEER, as well as being one of the dominant publishers at big conferences.

Maybe now their leadership wants to push for practicality so they don't end up like Google (also a research powerhouse but failing to convert to popular advances) so they are publicly pushing strong LLMs.

Meta still sit on all the juicy user data that they want to use AI on but they don’t know how. They are crowdsourcing development of applications and tooling.

Meta releases model. Joe builds a cool app with it, earns some internet points and if lucky a few hundred bucks. Meta copies app, multiply Joes success story with 1 billion users and earn a few million bucks.

Joe is happy, Meta is happy. Everybody is happy.

If they hadn't opened the models the llama series would just be a few sub-GPT4 models. Opening the models has created a wealth of development that has built upon those models.

Alone, it was unlikely they would become a major player in a field that might be massively important. With a large community building upon their base they have a chance to influence the direction of development and possibly prevent a proprietary monopoly in the hands of another company.

Controversial take:

Meta sees this as the way to improve their AI offerings faster than others and, eventually, better than others.

Instead of a small group of engineers working on this inside Meta, the Open Source community helps improve it.

They have a history of this with React, PyTorch, hhvm, etc. All these have gotten better as OS projects faster than Meta alone would have been able to do.

Llama is getting better and better, I heard this and Llama 3 will start to be good as GPT-4.

Who would have thought that Meta, that has been chucking billions on the metaverse is on the forefront of Open Source AI.

Not to mention their stock is up and they are worth $1TN, again.

Not sure how I feel about this given the fact of all the scandals that have plagued them and the massive 1BN fine from the EU, Cambridge Analytica, and last of all caused a genocide in Myanmar.

Goes to show that nobody cares about all of these scandals and just moves on onto the future, allowing Facebook to still collect all this data for their models.

If any other startup or mid sized company had at least two of these large scandals, they would be dead in the water.

I'm really curious what their goal is

Disclaimer: I do not work at Meta, but I work at a large tech company which competes with them. I don't work in AI, although if my VP asks don't tell them I said that or they might lay me off.

Multiple of their major competitors/other large tech companies are trying to monetize LLMs. OpenAI maneuvering an early lead into a dominant position would be another potential major competitor. If releasing these models slows or hurts them that is in and itself a benefit.

Why?

What benefit is there to grabbing market share from your competitors... in a business you don't even want to be in?

By that logic you could justify any bizarre business decision. Should Google launch a social network, to hurt their competitor Facebook? Should Facebook, Amazon and Microsoft each launch a phone?

Should Google launch a social network, to hurt their competitor Facebook?

I mean, Google did launch a social network, to hurt their competitor Facebook. It was a whole thing. It was even a really nice system, eventually.

And it turned out that Facebook had quite a moat with network effects. OpenAI doesn’t have such a moat, which may be what Meta is wanting to expose.

Google botched the launch, and they never nurture products after launch anyway. Google+ could have been more successful.

I enjoyed using Google plus more than any other social network, and managed to create new connections and/or have standard, authentic, real conversations with people I didn't know, most of them ordinary people with shared interests that I would probably wouldn't meet otherwise, some of them are people I can't believe I could connect with directly in any other way - newspapers and news sites editors, major SDK developers. And even with Kevin Kelly.

Who says they don't want to be in the market? Facebook has one product. Their income is entirely determined by ads on social media. That's a perilous position subject to being disrupted. Meta desperately wants to diversify its product offerings - that's why they've been throwing so much at VR.

Should Facebook, Amazon and Microsoft each launch a phone?

* https://www.lifewire.com/whatever-happened-to-the-facebook-p...

* https://en.wikipedia.org/wiki/Fire_Phone

* https://en.wikipedia.org/wiki/Windows_Phone

I think Meta's goal is to subvert Google, MS and OpenAI, after realizing it's not positioned well to compete with them commercially.

Could also be that these smaller models are a loss leader or advertisement for a future product or service... like a big brother to Llama3 that's commercial.

I believe there were rumors they are developing a commercial model: e.g. https://www.ft.com/content/01fd640e-0c6b-4542-b82b-20afb203f...

Commoditizing your complement. If all your competitors need a key technology to get ahead, you make a cheap/free version of it so that they can't use it as a competitive advantage.

The complement being the metaverse. You can’t handcraft the metaverse because it would be infeasible. If LLMs are a commodity that everyone has access to, then it can be done on the cheap.

Put another way - if OpenAI were the only game in town how much would they be charging for their product? They’re competing on price because competitors exist. Now imagine the price if a hypothetical high quality open source model existed that can customers can use for “free”.

That’s the future Meta wants. They weren’t getting rich selling shovels like cloud providers are, they want everyone digging. And everyone digs when the shovels are free.

Their goals are clear, dominance and stockholder value. What I'm curious about is how they plan to monetize it.

Use it in products, ex. the chatbots

Prevent OpenAI from dominating the market, and at the same time have the research community enhance your models and identify key use cases.

Meta basically got a ton of free R&D that directly applies to their model architecture. Their next generation AIs will always benefit from the techniques/processes developed by the clever researchers and hobbyists out there.

Zuck just wants another robot to talk to.

Poor guy just wants a friend that won't sell him out to the FTC or some movie producers.

Rule 5: commodify your complement

Content generation is complementary to most of meta's apps and projects

If you want to employ the top ML researchers, you have to give them what they want, which is often the ability to share their discoveries with the world. Making Llama-N open may not be Zuckerberg‘s preference. It’s possible the researchers demanded it.

Their goal is to counter the competition. You rarely should pick the exact same strategy as your competitor and count on out gunning them, rather you should counter them. OpenAI is ironically closed, well meta will be open then. If you can't beat them, you should try to degrade down the competitors value case.

Its a smart move IMO

Commoditize your compliments.

Same as MS, in the game, in the conversation, and ensuring next-gen search margins approximate 0.

If you go by what Zuck says, he calls this out in previous earnings reports and interviews[1]. It mainly boils down to 2 things:

1. Similar to other initiatives (mainly opencompute but also PyTorch, React etc), community improvements help them improve their own infra and helps attract talent.

2. Helping people create better content ultimately improves quality of content on their platforms (Both FoA & RL)

Sources:

[1]Interview with verge: https://www.theverge.com/23889057/mark-zuckerberg-meta-ai-el... . Search for "regulatory capture right now with AI"

Zuck: ... And we believe that it’s generally positive to open-source a lot of our infrastructure for a few reasons. One is that we don’t have a cloud business, right? So it’s not like we’re selling access to the infrastructure, so giving it away is fine. And then, when we do give it away, we generally benefit from innovation from the ecosystem, and when other people adopt the stuff, it increases volume and drives down prices.

Interviewer: Like PyTorch, for example?

Zuck: When I was talking about driving down prices, I was thinking about stuff like Open Compute, where we open-sourced our server designs, and now the factories that are making those kinds of servers can generate way more of them because other companies like Amazon and others are ordering the same designs, that drives down the price for everyone, which is good.

They were going to make most this anyway for Instagram filters, chat stickers, internal coding tools, VR world generation, content moderation, etc. Might as well do a little bit extra work to open source it since it doesn't really compete with anything Meta is selling.

Devil's advocate: they have to build it anyway for Meta verse and in general. Management has no interest in going into cloud business. They had Parse long time back but that is done. So why not to release it. They are getting goodwill/mindshare, may set up industry standard and get community benefit. It isn't very different from React, Torch etc.

I would guess mindshare in a crowded field, ie discussion threads just like this one that help with recruiting and tech reputation after a bummer ~8 years. (It's best not to overestimate the complexity/# of layers in a bigco's strategic thinking.)

To undermine the momentum of OpenAI.

If Meta were at the forefront, these models would not be openly available.

They are scrambling.

I imagine their goal is to simultaneously show that Meta is still SotA when it comes to AI and at the same time feed a community of people who will work for free to essentially undermine OpenAI's competitive advantage and make life worse for Google since at the very least LLMs tend to be a better search engine for most topics.

There's far more risk if Meta were to try to directly compete with OpenAI and Microsoft on this. They'd have to manage the infra, work to acquire customers, etc, etc on top of building these massive models. If it's not a space they really want to be in, it's a space they can easily disrupt.

Meta's late game realization was that Google owned the web via search and Apple took over a lot of the mobile space with their walled garden. I suspect Meta's view now is that it's much easier to just prevent something like this from happening with AI early on.

Model available, not open source. These models aren't open source because we don't have access to the data sets, nor the full code to train them, so we can't recreate the models even if we had the GPU time available to recreate them.

Everyone using AI in production is using Pytorch by Meta.

Which is open source.

I do not know anybody important in the AI space apart from Google using TensorFlow.

That may be true, but it's largely irrelevant. The ML framework in use has no bearing on whether or not you have the data required to reproduce the model being trained with that framework.

Do you and the GP have 350K GPUs and quality data to reproduce 1:1 whatever Facebook releases in their repos?

Even if you want to reproduce the model and they give you the data, you would need to do this at Facebook scale, so you and the GP are just making moot points all around.

https://about.fb.com/news/2023/05/metas-infrastructure-for-a...

https://www.theregister.com/2024/01/20/metas_ai_plans/

The fact that these models are coming from Meta in the open rather than Google which releases only papers with no model tell's me that Meta's models is open enough for everyone to use.

Besides, everyone using the Pytorch framework benefits Meta in the same way they were originally founded as a company:

Network effects

It's relevant.

If you don’t have a machine powerful enough to compile some heavy AAA game, does that make the game open source?

There are organisations that are capable of reproduction (e.g. EleutherAI), but yes, you're right, not having the data is largely irrelevant for most users.

The thing that bothers me more is that it's not actually an open-source licence; there are restrictions on what you can do with it, and whatever you do with the model is subject to those restrictions. It's still very useful and I'm not opposed to them releasing it under that licence (they need to recoup the costs somehow), but "open-source" (or even "open") it is not.

with pytorch (& so many open publications), Meta has had a unimaginably strong contribution to ai for a while

Cambridge Analytica is not a real scandal (did not affect any elections), and FB did not cause a genocide in Myanmar (they were a popular message board during a genocide, which is not the same thing)

It does seem like the nicest thing Facebook have ever done by giving so much to the open source LLM scene, I know that it might have been started by a leaker, but they have given so much voluntarily. I mean don’t get me wrong, I don’t like the company but I do really like some of the choices they have made recently.

But I do wonder in the back of my mind why. And I should be suspicious of their angle and I will keep thinking about it. Is it paranoid to think that maybe their angle is putting almost some kind of metadata by style of code being unique to different machines that they can trace generated code to different people? Is that their angle or am I biased in remembering who they have been for the past decade?

I'm not very plugged into how to use these models, but I do love and pay for both ChatGPT and GitHub Copilot. How does one take a model like this (or a smaller version) and leverage it in VS Code? There's a dizzying array of GPT wrapper extensions for VS Code, many of which either seem like kind of junk (10 d/ls, no updates in a year), or just lead to another paid plan, at which point I might as well just keep my GH Copilot. Curious what others are doing here for Copilot-esque code completion without Copilot.

https://continue.dev/ is a good place to start.

This looks really good..

It's great. It's super easy to install ollama locally, `ollama run <preferred model>`, change the continue config to point to it, and it just works. It even has an offline option by disabling telemetry.

Windows coming soon

ugh, not so easy.

Yes, that is certainly a downside I forgot to mention. Sorry to get your hopes up.

Continue doesn’t support tab completion like Copilot yet.

A pull/merge request is being worked on: https://github.com/continuedev/continue/pull/758

Release coming later this week!

Thank you for your efforts

beat me to the punch : )

Bonus points for being able to use local models!

I’ve been working on continue.dev, which is completely free to use with your own Ollama instance / TogetherAI key, or for a while with ours.

Was testing with Codellama-70b this morning and it’s clearly a step up from other OS models

How do you test a 70B model locally? I've tried to query, but the response is super slow.

Personally I was testing with TogetherAI because I don't have the specs for a local 70b. Using quantized versions helps (Ollama's downloads 4-bit by default, you can get down to 2), but it would still require a higher-end Mac. Highly recommend Together, it runs quite quickly and is $0.9/million tokens

are there any docs on setting up togetherAI with continue.dev? would be interested in checking that out as an alternative to OpenAI for experimenting with larger models that won't run/run well on a m1 max.

Definitely, here is a brief reference page for the Together provider: https://continue.dev/docs/reference/Model%20Providers/togeth..., and a higher-level explanation of model configuration here: https://continue.dev/docs/model-setup/overview

What’s the advantage of Together? The price is about the price of GPT 3.5 Turbo ($1/mil tokens is $0.001/thousand tokens), which has the advantage of wide ecosystem and support.

Yeah, CPU inference is incredibly slow, especially as the context grows. 4-bit quantized on an A6000 should in theory work.

If those rent-seeking bastards at NVidia hadn't killed NVL on the 4090, you could do it on two linked 4090s for only $4k, but we have to live under the thumb of monopolists until such time as AMD 1. catches up on hardware and 2. fixes their software support.

Free Bard is better than free ChatGPT... Not sure about paid versions

Bard censorship is annoying. One thing I've found (free) Bard to better than the rest at is summarizing book chapters, manuals, and docs. It is also surprisingly good at translation (X to English), as it often adds context to what its translating.

With careful prompt engineering, you can get a lot out of free Bard except when its censored.

Not sure why you talk about Bard's censorship and not Chatgpt's but cause in my experiences it is much worst.

I'm learning Rust. It seems to me that Bard is better than OpenAI 4 I use for free at work.

There are some projects that let you run a self-hosted Copilot server, then you set a proxy for the official Copilot extension.

https://github.com/fauxpilot/fauxpilot

https://github.com/danielgross/localpilot

When I was setting up a local LLM to play with I stood up my own Open AI API compatible server using llama-cpp-python. I installed the Copilot extension and set OverrideProxyUrl in the advanced settings to point to my local server, but CoPilot obstinately refused to let me do anything until I’d signed in to GitHub to prove that I had a subscription.

I don’t _believe_ that either of these lets you bypass that restriction (although I’d love to be proven wrong), so if you don’t want to sign up for a subscription you’ll need to use something like Continue.

This plugin just installs in Jetbrains IDE and sets up a local llama.cpp server https://plugins.jetbrains.com/plugin/21056-codegpt

I tried fauxpilot to make it work on my own llama.cpp instance, but didn't work out of the box. Filed a github issue, but did not get any traction. Eventually gave up on it. This was around 5 months ago. Things might have improved by now.

I use the plugin Twinny in conjunction with ollama to host the models. Easy setup and quite powerful assistance. You need a decent rig though, since you don't want any latency for features like autocomplete.

But even if you don't have a faster rig, you can still leverage it for slower tasks to generate docs or tests.

Twinny should really be more popular, didn't find a more powerful no-bullshit plugin for VSCode.

You can download it and run it with [this](https://github.com/oobabooga/text-generation-webui). There's an API mode that you could leverage from your VS Code extension.

I've been using Cody by Sourcegraph and liking it so far.

https://sourcegraph.com/cody

You can try it with Sourcegraph Cody. https://sourcegraph.com/cody

And instructions on how to change the provider to use Ollama w/ whatever model you want:

Install and run Ollama - Put ollama in your $PATH. E.g. ln -s ./ollama /usr/local/bin/ollama.

- Download Code Llama 70b: ollama pull codellama:70b

- Update Cody's VS Code settings to use the unstable-ollama autocomplete provider.

- Confirm Cody uses Ollama by looking at the Cody output channel or the autocomplete trace view (in the command palette).

- Update the cody settings to use "codellama:70b" as the ollama model

https://github.com/sourcegraph/cody/pull/2635

There is a bait and switch going on, and sam altman or mark zuckerberg are the first to tell you.

“No one can compete with us, but it’s cute to try! Make applications though” —almost direct quote from Sam Altman.

I have 64gb and an RTX 3090 and a macbook M3, and I already can’t run a lot of the newest models even in their quantized form.

The business model requires this to be a subscription service. At least as of today…

Realistically, what hardware would be required to run this? I assumed a RTX 3090 would be enough?

RTX 3090 has 24GB of memory, a quantized llama70b takes around 60GB of memory. You can offload a few layers on the gpu, but most of them will run on the CPU with terrible speeds.

You're not required to put the whole model in a single GPU.

You can buy a 24GB gpu for $150-ish (P40).

Wow that's a really good idea. I could potentially buy 4 Nvidia P40's for the same price as a 3090 and run inference on pretty much any model I want.

For reference for readers.

SUPPORTED

=========

* Ada / Hopper / A4xxx (but not A4000)

* Ampere / A3xxx

* Turing / Quadro RTX / GTX 16xx / RTX 20XX / Volta / Tesla

EOL 2023/2024

=============

* Pascal / Quadro P / Geforce GTX 10XX / Tesla

Unsupported

===========

* Maxwell

* Kepler

* Fermi

* Tesla (yes, this one pops up over and over, chaotically)

* Curie

Older don't really do GPGPU much. The older cards are also quite slow relative to modern ones! A lot of the ancient workstation cards can run big models cheaply, but (1) with incredible software complexity (2) very slowly, even relative to modern CPUs.

Blender rendering very much isn't ML, but it is a nice, standardized benchmark:

https://opendata.blender.org/

As a point of reference: A P40 has a score of 774 for Blender rendering, and a 4090 has 11,321. There are CPUs ($$$) in the 2000 mark, so about dual P40. It's hard for me to justify a P40-style GPU over something like a 4060Ti 16GB (3800), an Arc a770 16GB (1900), or a 7600XT 16GB (1300). They cost more, but the speed difference is nontrivial, as is the compatibility difference and support life. A lot of work is going into making modern Intel / AMD GPUs supported, while ancient ones are being deprecated.

P40 is essentially a faster 1080 with 24GB ram. For many tasks (including LLMs) it's easy to be memory bandwidth bottlenecked and if you are they are more evenly matched. (newer hardware has more bandwidth, sure but not in a cost proportional manner).

I find that my hosts using 9x P40 do inference on 70b models MUCH MUCH faster than a e.g. a dual 7763 and cost a lot less. ... and can also support 200B parameter models!

For the price of a single 4090, which doesn't have enough ram to run anything I'm interested in, I can have slower cards which have cumulatively 15 times the memory and cumulatively 3.5 times the memory bandwidth.

Interesting.

Technically, P40 is rated at an impressive 347.1GB/sec memory bandwidth, and 4060, at a slightly lower 272GB/sec. For bandwidth-limited workloads, the P40 still wins.

The 4090 is about 3-4x that, but as you point out, is not cost-competitive.

What do you use to fit 9x P40 cards in one machine, supply them with 2-3kW of power, and keep them cooled? Best I've found are older rackmount servers, and the ones I was looking at stoped short of that.

I technically have 10 plus a 100GBE card in them, but due to nvidia chipset limitations using more than 9 in a single task is a pain. (Also, IIRC one of the slots is only 8x in my chassis)

Supermicro has made a couple 5U chassis that will take 10x double width cards and provide adequate power and cooling. SYS-521GE-TNRT is one such example (I'm not sure off the top of my head which mine are, they're not labeled on the chassis, but they may be that).

They pricey new but they show up on ebay for 1-2k. The last ones I bought I paid $1800, I think the earlier set I paid $1500/ea-- around the time that ethereum gpu mining ended ( I haven no clue why someone was using chassis like these for gpu mining, but I'm glad to have benefited! ).

Just make sure you're comfortable with manually compiling the bitsandbytes and generally combine a software stack of almost out of date libraries

That's a good point. Are you referring to the out of date cuda libraries?

I don't remember exactly (either cuda directly or the cudnn version used by the flashattention)... Anyway, /r/localLlama has few instances of such builds. Might be really worthwhile looking that up before buying

P40 still works with 12.2 at the moment. I used to use K80s (which I think I paid like $50 for!) which turned into a huge mess to deal with older libraries, especially since essentially all ML stuff is on a crazy upgrade cadence with everything constantly breaking even without having to deal with orphaned old software.

You can get gpu server chassis that have 10 pci-slots too! for around $2k on ebay. But note that there is a hardware limitation on the PCI-E cards such that each card can only directly communicate with 8 others at a time. Beware, they're LOUD even by the standards of sever hardware.

Oh also the nvidia tesla power connectors have cpu-connector like polarity instead of pci-e, so at least in my chassis I needed to adapt them.

Also keep in mind that if you aren't using a special gpu chassis, the tesla cards don't have fans, so you have to provide cooling.

Can that be split across multiple GPUs? i.e. what if I have 4xV100-DGXS-32GBs?

A Studio Mac with an M1 Ultra (about 2800 USD used) is actually a really cost effective way to run in. Its total system power consumption is really low, even spitting out tokens at full tilt (<250W).

You can run a similarly sized model - Llama 2 70B - at the 'Q4_K_M' quantisation level, with 44 GB of memory [1]. So you can just about fit it on 2x RTX 3090 (which you can buy, used, for around $1100 each)

Of course, you can buy quite a lot of hosted model API access or cloud GPU time for that money.

[1] https://huggingface.co/TheBloke/Llama-2-70B-GGUF

A 70B model is quite accessible; just rent a data center GPU hourly. There are easy deployment services that are getting better all the time. Smaller models can be derived from the big ones to run on a MacBook running Apple Silicon. While the compute won’t be a match for Nvidia hardware, a MacBook can pack 128GB of RAM and run enormous models - albeit slowly.

Ok, well now that we’ve downvoted me below the visibility threshold, I was being sincere. And Altman did say that. I am not a hater.

So. Maybe we could help other people figure out why VRAM is maxing out. I think it has to do with various new platforms leaking memory.

In my case, I suspect ollama and diffusers are not actually evicting VRAM. nvidia-smi shows it in one case, but I haven’t figured it out yet.

Hey, my point remains. The models are going to get too expensive for me, personally, to run locally. I suspect we’ll default into subscriptions to APIs because the upgrade slope is too steep.

Haha, for posterity, if anyone was interested in the things I mentioned. My specific frustrations came from leaving out torch.no_grad when running inference.

I used copilot to refactor that, and it just didn’t put no_grad back, and I did not notice.

I was uselessly recalculating all of my weights to /dev/null and waste heat.

I daemonized my process to control it more tightly, but I still see way above expected vram allocation.

Just keep in mind. It’s not like the models are going to get smaller, past the quantization limit. What, is there a quantization of retrievable information to 0 bits? ;)

My Macbook has a mere 64GB and that's plenty to run 70B models at 4-bit :) LM Studio is very nice for this.

the connection to the classic bait-and-switch seems tenuous at best

At least as of today…

This is the exact opposite of bait and switch. The current model couldn't be un-opensourced and over time it will just become easier to run it.

Also unless there is reason to believe that prompt engineering of different model families is very different(which honestly I don't believe), there is no effect of baiting. I believe it will always be the case that best 2-3 models would be closed weights.

Curious what's the current SOTA local copilot model? Are there any extensions in vscode that give you a similar experience? I'd love something more powerful than copilot for local use (I have a 4090, so I should be able to run a decent number of models).

When this 70b model gets quantized you should be able to run it fine on your 4090. Check out 'TheBloke' on huggingface and the llamacpp to run the gguf files.

I think your take is a bit optimistic. I like quantization as much as the next person, but even the 2-bit model won’t fit entirely on a 4090: https://huggingface.co/TheBloke/Llama-2-70B-GGUF

I would be uncomfortable recommending less than 4-bit quantization on a non-MoE model, which is ~40GB on a 70B model.

The great thing about gguf is that it will cross to system RAM if there isn't enough VRAM. It will be slower, but waiting a couple minutes for a prompt response isn't the worst thing if you are the type that would get use out of a local 70b parameter model. Then again, one could have grabbed 2x 3090s for the price of a 4090 and ended up with 48gb of VRAM in exchange for a very tolerable performance hit.

The great thing about gguf is that it will cross to system RAM if there isn't enough VRAM.

No… that’s not such a great thing. Helpful in a pinch, but if you’re not running at least 70% of your layers on the GPU, then you barely get any benefit from the GPU in my experience. The vast gulf in performance between the CPU and GPU means that the GPU is just spinning its wheels waiting on the CPU. Running half of a model on the GPU is not useful.

Then again, one could have grabbed 2x 3090s for the price of a 4090 and ended up with 48gb of VRAM in exchange for a very tolerable performance hit.

I agree with this, if someone has a desktop that can fit two GPUs.

Multi-GPU in desktop chassis gets crazy pretty quickly. If you don't care about aesthetics and can figure out both the power delivery and PCI-E lane situation, https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni... has an example that will make a Bitcoin mining rig look clean.

Water cooling can get you down to 2x slot height, with all of the trouble involved in water cooling. NVIDIA really segmented the market quite well. Gamers hate blower cards, but they are the right physical dimensions to make multi-GPU work well, and they are exclusively on the workstation cards.

The main benefit of a GPU in that case is much faster prompt reading. Could be useful for Code Llama cases where you want the model to read a lot of code and then write a line or part of a line.

GGUF is just a file format. The ability to offload some layers to CPU is not specific to it nor to llama.cpp in general - indeed, it was available before llama.cpp was even a thing.

I'm pretty sure I didn't assert that it was more that a file format or that llama.cpp was a pioneer in that regard?

The great thing about gguf is that it will cross to system RAM if there isn't enough VRAM.

Then you can just run it entirely on CPU. There is no point to buy an expensive GPU to run LLMs to be bottlenecked by your CPU in the first place. Which is why I do not get so excited with these huge models, as they gain less traction as not as many people can run them locally, and finetuning is probably more costly too.

fortunately it will run on my UMA mac. it's made me curious what the trade offs are. Would I be better off with a 4090 or a Mac with 128+gig of uma memory

Even the M3 Max seems to be slower than my 3090 for LLMs that fit onto the 3090, but it’s hard to find comprehensive numbers. The primary advantage is that you can spec out more memory with the M3 Max to fit larger models, but with the exception of CodeLlama-70B today, it really seems like the trend is for models to be getting smaller and better, not bigger. Mixtral runs circles around Llama2-70B and arguably ChatGPT-3.5. Mistral-7B often seems fairly close to Llama2-70B.

Microsoft accidentally leaked that ChatGPT-3.5-Turbo is apparently only 20B parameters.

24GB of VRAM is enough to run ~33B parameter models, and enough to run Mixtral (which is a MoE, which makes direct comparisons to “traditional” LLMs a little more confusing.)

I don’t think there’s a clear answer of what hardware someone should get. It depends. Should you give up performance on the models most people run locally in hopes of running very large models, or give up the ability to run very large models in favor of prioritizing performance on the models that are popular and proven today?

M3 Max is actually less than ideal because it peaks at 400 Gb/s for memory. What you really want is M1 or M2 Ultra, which offers up to 800 Gb/s (for comparison, RTX 3090 runs at 936 GB/s). A Mac Studio suitable for running 70B models with speeds fast enough for realtime chat can be had for ~$3K

The downside of Apple's hardware at the moment is that the training ecosystem is very much focused on CUDA; llama.cpp has an open issue about Metal-accelerated training: https://github.com/ggerganov/llama.cpp/issues/3799 - but no work on it so far. This is likely because training at any significant sizes requires enough juice that it's pretty much always better to do it in the cloud currently, where, again, CUDA is the well-established ecosystem, and it's cheaper and easier for datacenter operators to scale. But, in principle, much faster training on Apple hardware should be possible, and eventually someone will get it done.

Yep, I seriously considered a Mac Studio a few months ago when I was putting together an “AI server” for home usage, but I had my old 3090 just sitting around, and I was ready to upgrade the CPU on my gaming desktop… so then I had that desktop’s previous CPU. I just had too many parts already, and it deeply annoys me that Apple won’t put standard, user-upgradable NVMe SSDs on their desktops. Otherwise, the Mac Studio is a very appealing option for sure.

Well, the workstation-class equivalent of a 4090 -- RTX 6000 Ada -- has enough RAM to work with a quantized model, but it'll blow away anyone's budget at anywhere between $7,000 and $10,000.

This is a completely fair, but open question. Not to be a typical HN user, but when you say SOTA local, the question is really what benchmarks do you really care about in order to evaluate. Size, operability, complexity, explainability etc.

Working out what copilot models perform best has been a deep exercise for myself and has really made me evaluate my own coding style on what I find important and things I look out for when investigating models and evaluating interview candidates.

I think three benchmarks & leaderboards most go to are:

https://huggingface.co/spaces/bigcode/bigcode-models-leaderb... - which is the most understood, broad language capability leaderboad that relies on well understood evaluations and benchmarks.

https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul... - Also comprehensive, but primarily assesses Python and JavaScript.

https://evalplus.github.io/leaderboard.html - which I think is a better take on comparing models you intend to run locally as you can evaluate performance, operability and size in one visualisation.

Best of luck and I would love to know which models & benchmarks you choose and why.

when investigating models and evaluating interview candidates

Wow, just realized, in the future employers will mostly interview LLMs instead of people.

I'm honestly more interested in anecdotes and I'm just seeking anything that can be a drop-in copilot replacement (that's subjectively better). Perhaps one major thing I'd look for is improved understanding of the code in my own workspace.

I honestly don't know what benchmarks to look at or even what questions to be asking.

https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul... - Also comprehensive, but primarily assesses Python and JavaScript.

I wonder why they didn't use DeepSeek under the "senior" interview test. I am curious to see how it stacks up there.

How come a company as big as Meta still uses bit.ly ?

What else would they use?

Something like meta.com/our_own_tech_handles_this

Not sure it's preferable to hire people at fb salaries to maintain a link shortener than just to use a reputable free one?

They already have fb.com.

Every big company has one of these anyway, and usually more involved (internal DNS, VPN, etc). A link shortener is like an interview question.

fb.com seems like a reasonable choice.

their own shortener, e.g. fb.me, presumably

Their own?

Ironically it doesn't help to use link shorteners on twitter anyway - all URLs posted to twitter count as 23 characters. The hypertext is the truncated original URL string, and the URL is actually a t.co link.

Shorteners these days are for analytics, not shortening per se.

Yeah those links don't even work for me anymore as it's a tracking site.

Not only that, the announcement is on Twitter, a company that at least used to be their biggest competitor. Old habits die hard, huh?

Because this is a marketing channel. They handle tracking of FB/IG messages by other means, intended for engineers.

Given how good some of the smaller code models are (such as Deepseek Coder at 6.7B), I'll be curious to see what this 70B model is capable of!

My personal experience is that Deepseek far exceeds code llama of the same size, but it was released quite a while ago.

Agreed—I hope Meta studied Deepseek's approach. The idea of a Deepseek Coder at 70B would be exciting.

The “approach” is likely just training on more tokens.

It's the quality of data and the method of training, not just the amount of tokens (per their paper release a few days ago)

There's a deepseek coder around 30-35b and it has almost identical performance to the 7b on benchmarks.

AlphaCodium is the newest kid on the block that's SoTA pass@5 on coding tasks (authors claim at least 2x better than GPT4): https://github.com/Codium-ai/AlphaCodium

As for small models, Microsoft has been making noise with the unreleased WaveCoder-Ultra-6.7b (https://arxiv.org/abs/2312.14187).

Are weights available?

AlphaCodium is more of a prompt engineering / flow engineering strategy, so it can be used with existing models.

AlphaCodium author says he should have used DSPy

https://twitter.com/talrid23/status/1751663363216580857

Is this better than GPT4's Grimoire?

Seems worse according to this benchmark:

https://huggingface.co/spaces/bigcode/bigcode-models-leaderb...

Phind [1] uses the larger 34B Model. Still, I'm also curious what they are gonna do with this one.

[1] https://news.ycombinator.com/item?id=38088538

Credit where credit is due, Meta has had a fantastic commitment towards open source ML. You love to see it.

Wasn't LLaMa originally a leak that they were then forced to spin into an open source contribution?

Not to diminish the value of the contribution, but "commitment" is an interesting word choice.

It wasn't a leak in the typical sense. They sent the weights to pretty much everyone who asked nicely.

When you send something interesting to thousands of people without vetting their credentials, you'd expect the stuff to get "leaked" out eventually (and sooner rather than later).

I'd say it's more appropriate to say the weights were "pirated" than "leaked".

That said, you're probably correct that the community that quickly formed around the "pirated" weights might have influenced Zuckerberg to decide to make llama2's more freely accessible.

If the image synthesis is an example, I believe this to be a winning strategy in the long run to get the most competent model. The LLM space just moves a bit slower since the entry requirements for compute power are that much higher and data preparation is maybe a bit more "dry" than tagging images.

Meta has source available licenses, not open source ones.

Yes but: if the commitment is driven by internal researchers and coders standing firm about making their work open source (a rumour I’ve heard a couple times), most of the credit goes to them.

Redistributable and free to use weights does not make a model open source (even if it's really nice, with very few people having access to that kind of training power).

Are these trained on internal Code bases or just the public repositories?

Would be a really bad idea to train on internal code I would think. Besides, there is no shortage of open source code (even open source created by Meta) out there.

Correct that it’s a bad idea to train on internal code. However surprisingly there is a shortage of open source code. These models are trained on substantially all the available open source code that these companies can get their hands on.

people are able to extract training data from llms with different methods, so there's no way this was trained on internal code

The github [0] hasn't been fully updated, but it links to a paper [1] that describes how the smaller code llama models were trained. It would be a good guess that this model is similar.

[0] https://github.com/facebookresearch/codellama [1] https://ai.meta.com/research/publications/code-llama-open-fo...

Without knowing any details, I'm almost sure that they did not train it on internal code, as it might be possible to reveal that code otherwise (given the right prompt, or just looking at the weights).

Here's the model on Hugging Face: https://huggingface.co/codellama/CodeLlama-70b-hf

I hope someone will soon post a quantized version that I can run on my macbook pro.

Ollama has released the quantized version.

https://ollama.ai/library/codellama:70b https://x.com/ollama/status/1752034686615048367?s=20

Just need to run `ollama run codellama:70b` - pretty fast on macbook.

Do you know how much vram is required?

ollama run codellama:70b pulling manifest pulling 1436d66b6

1.1 GB/ 38 GB 24 MB/s 25m21s

Everyone is mentioning using 4090 and a smaller model, but I rarely see an analysis where the energy consumption is used.

I think Copilot is already highly subsidized by Microsoft.

Let's say you use Copilot around 30% of your daily work hours. How much kWh does an opensource 7B or 13B model use then in a month on one 4090?

EDIT:

I think for a 13B at 30% use per day it comes around 30$/no on energy bill.

So probably with a even more smaller but capable model can beat the Copilot monthly subscription.

Don't even need to do hard math: compare using co-pilot style LLM (bursts of 100% GPU every wee while) vs gaming on your 4090 (running at 100% for x hours).

Subscription models are generally subsidized by people barely using them. So I wouldn’t be surprised if the average is closer to 10%.

Using a model 30% of the day is only maybe 100 instances of use, with each lasting for about 6 seconds.

So really you're looking at using the GPU for around 10 minutes a day.

Monthly cost is pennies.

Running models locally using GPU inference shouldn't be too bad as the biggest impact in terms of performance is ram/vram bandwidth rather than compute. Some rough power figures for a dual AMD GPU setup (24gb vram total) on a 5950x (base power usage of around 100w) using llama.cpp (i.e., a ChatGPT style interface, not Copilot):

46b Mixtral q4 (26.5 gb required) with around 75% in vram: 15 tokens/s - 300w at the wall, nvtop reporting GPU power usage of 70w/30w, 0.37kWh

46b Mixtral q2 (16.1 gb required) with 100% in vram: 30 tokens/s - 350w, nvtop 150w/50w, 0.21kWh.

Same test with 0% in vram: 7 tokens/s - 250w, 0.65kWh

7b Mistral q8 (7.2gb required) with 100% in vram: 45 tokens/s - 300w, nvtop 170w, 0.12kWh

The kWh figures are an estimate for generating 64k tokens (around 35 minutes at 30 tokens/s), it's not an ideal estimate as it only assumes generation and ignores the overhead of prompt processing or having longer contexts in general.

The power usage essentially mirrors token generation speed, which shouldn't be too surprising. The more of the model you can load into fast vram the faster tokens will generate and the less power you'll use for the same amount of tokens generated. Also note that I'm using mid and low tier AMD cards, with the mid tier card being used for the 7b test. If you have an Nvidia card with fast memory bandwidth (i.e., a 3090/4090), or an Apple ARM Ultra, you're going to see in the region of 60 tokens/s for the 7b model. With a mid range Nvidia card (any of the 4070s), or an Apple ARM Max, you can probably expect similar performance on 7b models (45 t/s or so). Apple ARM probably wins purely on total power usage, but you're also going to be paying an arm and a leg for a 64gb model which is the minimum you'd want to run medium/large sized models with reasonable quants (46b Mixtral at q6/8, or 70b at q6), but with the rate models are advancing you may be able to get away with 32gb (Mixtral at q4/6, 34b at q6, 70b at q3).

I'm not sure how many tokens a Copilot style interface is going to churn though but it's probably in the same ballpark. A reasonable figure for either interface at the high end is probably a kWh a day, and even in expensive regions like Europe it's probably no more than $15/mo. The actual cost comparison then becomes a little complicated, spending $1500 on 2 3090s for 48gb of fast vram isn't going to make sense for most people, similarly making do with whatever cards you can get your hands on so long as they have a reasonable amount of vram probably isn't going to pay off in the long run. It also depends on the size of the model you want to use and what amount of quantisation you're willing to put up with, current 34b models or Mixtral at reasonable quants (q4 at least) should be comparable to ChatGPT 3.5, future local models may end up getting better performance (either in terms of generation speed or how smart they are) but ChatGPT 5 may blow everything we have now out of the water. It seems far too early to make purchasing decisions based on what may happen, but most people should be able to run 7b/13b and maybe up to 34/46b models with what they have and not break the bank when it comes time to pay the power bill.

Can you explain why big tech company make a race to release an open source model? If model is free and open source then how will they earn and how will they compete with others?

they want to incentivize dependency

Commoditize your complement?

We made a Jetbrains plugin called CodeGPT to run this locally https://plugins.jetbrains.com/plugin/21056-codegpt

Are seamless conversations still handled, using the truncation method described in #68?

I was curious if some kind of summary or compression of old exchanges flagged as such might allow the app to remember stuff that had been discussed but fallen outside the token limit.

But possibly request key details lost during summary to bring them back into the new context.

I had thought chatgpt was doing something like this but haven’t read about it.

Is there a quantized version available for ollama or is it too early for that?

Already there, it looks like: https://ollama.ai/library/codellama

(Look at “tags” to see the different quantizations)

Can anyone tell me what kind of hardware setup would be needed to fine-tune something like this? Would you need a cluster of GPUs? What kind of size + GPU spec would you think is reasonable (e.g. wrt VRAM per GPU etc).

I just use my M2 MacBook Pro. Works great on big models.

I am trying this on perplexity labs. How does it work so fast?

This is a prime example of the positive aspects of capitalism. Meta has its own interests, of course, but as a side effect, this greatly benefits consumers.

Baptiste Roziere gave a great talk about Code Llama at our meetup recently: https://m.youtube.com/watch?v=_mhMi-7ONWQ

I highly recommend watching it.

This looks potentially interesting if it can be ran locally on say, an M2 Max or similar - and if there’s an IDE plugin to do the Copilot thing.

Anything that saves me time writing “boilerplate” or figuring out the boring problems on projects is welcome - so I can expend the organic compute cycles on solving the more difficult software engineering tasks :)

Can anyone recommend an Emacs mode for this model?

Hope the M3 Ultra Studio comes along soon. Would be great to get a cpu/gpu of that level + the ram capacity it would likely be able to handle.

Any good resources or suggestions for system/pre-prompt for general coding or when targeting a specific language? ie, using the CodeLlama and working on typescript, ruby, rust, elixir etc.. is there a universal prompt that gives good results or would you want to adjust the prompt depending on the language you're targeting?

Benchmarks?