return to table of content

Fine tune a 70B language model at home

vouaobrasil
22 replies
6h53m

It would be great if we were a bit more respectful towards our natural resources. Using so much energy to play with language models is a waste of resources.

nl
3 replies
6h41m

They are using gaming GPUs. If you want to complain about a waste of natural resources there seems to be a lot of people playing games...

yard2010
1 replies
4h50m

Video games are like the opposite of waste. This planet can go to hell if I can't enjoy art.

Sohcahtoa82
0 replies
33m

There are some people that consider all forms of entertainment that don't create some form of extrinsic value to be a waste of time, energy, and materials.

I feel bad for them. They're going to be the ones that lay in their death bed thinking "I wish I had allowed myself to have more fun".

vouaobrasil
0 replies
6h24m

Well, they serve the same function. Modern consumerist society removes almost all real autonomy from people and makes them do fairly meaningless tasks (most jobs). So, it's rather expected that we need to seek out greater and greater amusements (gaming and playing with ridiculous models) so we're kept in a somewhat happy state, instead of realizing the trick of the system that will one day come crashing down due to its unsustainability.

nraford
2 replies
6h43m

It would be great if I had a bathtub full of ice cream as well, and if we all lived in a world overflowing with love, respect and joy for all living things. Until then, I'm happy that these kinds of incredible tools are (and increasingly will be) be in more of our hands for close to free. Upwards and onwards!

vouaobrasil
1 replies
6h26m

Seems like with every passing year, we are going downwards, not upwards. Perhaps it only seems the other way around to those with the greatest addictions to technology, who will justify any development to satisfy their cravings.

guiriduro
0 replies
3h37m

Well, I for one am happy that less compute is being wasted on blockchain, and if the total BTUs and tonnes of CO2 remain equal while the proportion allocated to AI goes up, that'll also be a good thing. Doing useful stuff, and becoming more efficient (eliminating high carbon wasteful human activities and replacing with AI compute using less overall carbon), is also a net win.

gkbrk
2 replies
6h0m

Using so much energy to play with language models is a waste of resources.

Why do you get to decide what's wasteful and what's useful?

We have more energy than we could ever hope to use from the sun and from nuclear. The solution isn't telling people they're wasting precious energy that you would put to better use. Just build more nuclear reactors and more solar.

vouaobrasil
1 replies
5h47m

Why do you get to decide which habitats die and which live by using all this modern technology that relies on mining and kills them?

Applejinx
0 replies
2h42m

I mean, this is a fair point but right now you're not talking to a libertarian who believes the technology inevitably governs itself, to the destruction of all around it.

You're talking to more of a civilization-type who believes you have to use regulation and, if necessary, state violence to stop types of mining that kill habitats, because the technology certainly isn't going to up and decide to do that. It's the job of society to figure this stuff out, arrive at such positions and defend them. There are plenty of good reasons to defend the protection of habitats, even for purely self-interested pragmatic reasons.

firtoz
2 replies
6h46m

The point in the article is to use less resources. So, yes?

vouaobrasil
1 replies
6h23m

People think that by making a system use less resources, the entire use of it on a societal level will be reduced. Unfortuantely, we must watch out for more efficiency making more poeple use it, and potentially increasing absolute quantity of energy being used.

MacsHeadroom
0 replies
3h5m

Energy usage is good, actually. Energy scarcity and dirty energy are bad. But energy usage is good.

We should strive to use and produce orders of magnitude more (clean) energy.

Cheer2171
2 replies
5h18m

I started reading your blog linked from your profile. I was disappointed that you used so much energy to play with language by writing up thoughts and hosting them on the internet. Seems like a waste of resources. Why do you think you have the right to burn all that carbon just so you can share your thoughts with the world? Why not just save electricity and take a walk through nature? That's how I think you should use your resources. I think I know better than you about how you should use energy. I think you should have to follow my ideology about energy use.

vouaobrasil
1 replies
5h17m

You are right in a way. I hope to one day give up the internet completely...

Applejinx
0 replies
2h39m

Careful, he'll have you disappear in a puff of logic. And then get killed at the next zebra crossing. Do you want that on your conscience? :)

infecto
1 replies
5h22m

I hate this more recent doomer logic. Who is the arbiter of deciding whats a waste and not a waste? Why use such a basic and uncompelling narrative of telling how others should live their lives? I try to be thoughtful of my purchases and conserve my own resources and happy to talk about it but telling people that "doing x is a waste of resources" is a fools errand. Humanity has always progressed when in a collective group, it won't slow down now even though some individuals might drop out of it. I don't know what the future holds, collectively we will continue to progress and I see the bright side of all the recent momentum in renewable energy and the focus on making systems more efficient.

Not the first time this character has popped up here on HN.

"I write full-time and try to convince people of the danger of AI and advanced technology."

vouaobrasil
0 replies
5h13m

You are probably right...it may not have been the best approach. Well, I get emotional sometimes about the rapid advancement of technology and I do think it is a mistake of humanity to do so.

yard2010
0 replies
4h53m

Also, whoever is married, please for the love of god merge your facebook accounts. It takes up too much space on the internet.

llmzero
0 replies
6h32m

What is a little contradictory is that designing a system to use less resources can increases the number of people fine tuning models so that the final result can be a net global increase in the total energy use. A hypothetical goal could be to reuse fine tuning, that is designing a knowledge graph in which you fine tuning from a previously fine tuned model (like dynamic programming, save the result of previous computations). Lora allow us to store the small matrices with low cost.

exitb
0 replies
6h5m

Running a powerful GPU at full load using coal-generated energy causes two orders of magnitude less emissions than flying on an airliner (per passenger). If you ever flown anywhere in your life, I don't think you can climb on that high horse.

Applejinx
0 replies
2h47m

Beats crypto in my opinion. I feel like there's a sliding scale for this stuff, and playing with language models is genuinely interesting, though it's harder to predict what the real benefits will be.

I feel certain there will be great benefit, but not in the way AI hypesters expect there to be.

int_19h
19 replies
9h53m

This is great, but one thing I really hoped would come sooner is fast training on Metal. As things are, you can get an M1/M2 Ultra (~800 Gb/s memory bandwidth; for comparison, RTX 4090 is ~1050 Gb/s) Mac Studio with 128Gb RAM for ~$3500. For large model inference, this is already way more affordable than stacking GPUs while being "fast enough", but training solutions are basically non-existent. I do wonder why; it feels like a low-hanging fruit.

buildbot
14 replies
9h49m

Compute limited - an m2 ultra has 27 tflops, a 4090 80+

yumraj
8 replies
9h30m

So it should just take longer..

AnthonyMouse
7 replies
8h40m

If you don't care how long it takes you can get an old server with 128GB of RAM for a lot less than $3500.

ErneX
6 replies
7h59m

But that isn't GPU memory right? On the Mac it is.

rthnbgrredf
5 replies
7h8m

The issue here isn't specifically about the classification of memory, be it "unified memory," RAM, or VRAM. The primary concern is ensuring there's enough memory capacity for the models required for inference. The real question at hand is the Mac's value proposition in terms of inference speed, particularly for models as large as 70 billion parameters. Utilizing a 4090 GPU can facilitate real-time inference, which is the desired outcome for most users. In contrast, a Mac Studio offers close to real-time inference speeds, which might be disappointing for users expecting a real-time experience. Then, there's the option of CPU + RAM-based inference, which suits scenarios where immediate responses aren't crucial, allowing for batch processing of prompts and subsequent retrieval of responses. Considering the price points of both the Mac Studio and high-end GPUs are relatively comparable, it begs the question of the practicality and value of near real-time inference in specific use cases.

abtinf
2 replies
6h51m

Hello gpt.

rthnbgrredf
1 replies
4h9m

I'm not a gpt. But now you could say that this is exactly how a gpt would answer and we get stuck in a loop and there's no obvious way to prove that I'm not a gpt.

pbhjpbhj
0 replies
2h16m

'Write me something profane?' That probably weeds out commercially available GPTs?

Lalabadie
1 replies
6h30m

Considering that the topic is approachability and energy efficiency, that Mac Studio will do reasonably fast inference while consuming <200W at full load.

The speed is certainly not comparable to dedicated GPUs, but the power efficiency is ridiculous for a very usable speed and no hardware setup.

Applejinx
0 replies
2h52m

This, and then you get to have a Mac Studio.

I have one, where I selected an M1 Ultra and 128G RAM to facilitate just this sort of thing. But in practice, I'm spending much more time using it to edit 4K video, and as a recording studio/to develop audio plugins on, and to livestream while doing these things.

Turns out it's good at these things, and since I have the LLAMA 70b language model at home and can run it directly unquantized (not at blinding speed, of course, but it'll run just fine), I'm naturally interested in learning how to fine tune it :)

erichocean
4 replies
9h37m

Memory limited - an m2 ultra has >150GiB, a 4090 24GiB

lawlessone
3 replies
3h4m

So why is nobody doing this?

ttul
2 replies
2h51m

My personal experience with Apple Silicon and machine learning in comparison with Nvidia is that the libraries are often missing various features on Apple, leading to headaches when trying to use various libraries and tools. Apple is working to bridge the gap and I am excited for the gap to be closed because the memory bandwidth on big M2 and M3 machines is monstrous.

lawlessone
1 replies
2h3m

sounds similar to how people have described game dev for Mac. The hardware is there. It just isn't supported.

imhoguy
0 replies
1h34m

Apple could single-handedly kill consumer dGPU market if they have released proper low-level APIs for their M1/2/3. I feel they have some huge coming out in the pipe to tumble the "AI" market.

sqreept
3 replies
9h50m

M1, M2, M3 still have very low number of GPU cores. Apple should release some better hardware to take advantage of their recently released MLX library.

sbinnee
1 replies
6h1m

At this moment it looks clear to me that Apple won’t go that way. It’s enough for them to focus on inference and actual application not the heavy training part. They have been probably training models on a cluster with non Apple silicon and make them available for their chips only for inference.

ttul
0 replies
2h49m

Not to mention entirely outsourcing training workloads to specialist firms. Apple does a lot of secretive outsourcing of things you might think they would or should do in-house. This contrasts with Google and Meta who seem to like keeping everything in-house.

kergonath
0 replies
2h45m

It’s true that their GPUs are slower than Nvidia’s. But keep in mind that cores are really different and cannot be compared across architectures. You want more Gflops, not necessarily more cores.

artninja1988
14 replies
19h31m

So, as I understand it, this is for finetuning a preexisting llm? So not actually training one from scratch. I guess that would be too much to ask for. Nonetheless, cheers to Jeremy and the gang for the work.

jph00
12 replies
18h45m

For now, it's for finetuning.

The issue of to what degree it might be possible to train a model from scratch using QLoRA is still an open question. The relora paper showed that it can work in some situations, but attempts to scale it up were unsuccessful. The recent DoRA paper perhaps might allow a "re-DoRA" approach to work. If so, that could be combined with quantization to do "re-QDoRA"!

qsi
9 replies
12h11m

The headline and introduction on the linked page say "You can now train a 70b language model at home. We’re releasing an open source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs."

How does "fine tuning" differ from "training?" Reading the linked article I had assumed I could create my own trained LLM at home with two 24GB GPUs.

jph00
3 replies
10h50m

The article actually sneaks in a footnote that answers this (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html#fn1): "Throughout this article “training” can refer to either pre-training, or fine-tuning".

(Generally, we've told students at fast.ai since 2017 that they should almost never be starting from random weights -- most of the time it's best to start with a pretrained model and fine-tune that, even if it's from a somewhat different domain to the problem you're working on.)

Tomte
2 replies
8h26m

Have you changed your mind on „The End of Finetuning“ (https://www.latent.space/p/fastai ) or did I simply misunderstand that?

Oh, and thanks for quirky stuff like your APL video!

jph00
1 replies
7h52m

The title of that podcast isn't something I actually said (IIRC). I commented in that interview that I feel we should not consider pre-training and fine-tuning to be as separate as we do now.

Tomte
0 replies
7h45m

So you‘re generally in favor of mixing training data without separating them in phases, but when I use pretrained weights (as you recommend instead of random weights) I generally do not have access to whatever the neural net was pretrained with by someone else, so I have to make do with my finetuning data, yes?

Thank you!

keremturgutlu
2 replies
11h18m

You most definitely can, the main difference is that only partial ~2% of the parameters get updated during training. Say you start from a model like llama-70B which already knows english and has some world knowledge based on its pretraining dataset. It might not be ideal for drastic domain shifts, such as adapting a model to learn new languages (which might require a new tokenizer and model embeddings) but still might be possible to some extent.

qsi
1 replies
11h10m

Thank you for clarifying. I have been wanting to dip my toes into LLMs at home but obviously I have a steep learning curve ahead of me, and would need considerably beefier hardware!

chasd00
0 replies
5h42m

It’s steep but manageable, absolutely go for it. The more people who understand the tech the better.

IanCal
1 replies
11h16m

You can take an existing 70B model and train it to do a more specific task. You're teaching it the task but you're relying on a foundation model for the base understanding of the world/words/etc.

qsi
0 replies
11h10m

OK, that makes sense. Thank you!

buildbot
0 replies
12h49m

Lit-GPT is what I have been using to pretrain models at home: https://github.com/Lightning-AI/litgpt Using the openwebtext example, I can train a 700M param model to 2.6 loss in a few days on dual 4090s. Pretty awesome!

jncfhnb
8 replies
2h50m

So… why do people want to fine tune LLMs at home? It seems very unlikely to provide value.

* you’re probably not going to succeed at injecting new knowledge in a way that feels satisfyingly top of mind to the bot

* you’re probably not going to create such a meaningfully new style that it warrants a Lora like in images

What’s an example use case?

tracerbulletx
2 replies
2h31m

Hmm why would someone on a forum called hacker news want to tinker and explore an exciting new technology. Who can say? One of life’s great mysteries really.

jncfhnb
1 replies
2h7m

I’m curious what they’re trying to do because I’m curious and I don’t see it. You’re the one being a dismissive dick here.

SamPatt
0 replies
33m

So… why do people want to fine tune LLMs at home? It seems very unlikely to provide value.

Asking the first question is fine, but your follow-up comment sounds more dismissive than curious.

That's probably why the snarky response.

Solvency
2 replies
2h35m

Illicit fan fiction. Whether it's image or text models.

It's ALWAYS illicit fan fiction.

jncfhnb
0 replies
2h8m

I mean I’ve seen the expressive niches on image models of civitai, but do you really need custom fine tuned LLMs for text fanfiction?

Like sure, you need something that is not the friendly question answerer; but do you really need such a broad population as in images to suit your needs? I’m guessing no?

CharlesW
0 replies
2h26m

Consensual sex between Gilligan and the Professor is not a crime.

mttpgn
1 replies
2h21m

I find that available LLMs have difficulty recalling instances in specific works by given authors. For example, if you ask GPT-4 "In which Philip K. Dick novel does the protagonist character consider converting to Judaism and moving to Israel?" it will respond with Dick's best known book _The Man in the High Castle_ and the character Frank Fink. The answer is incorrect. Israel does not exist in the world of that novel; furthermore, the character of Fink already is Jewish. The correct answer is Angel Archer in _The Transmigration of Timothy Archer_.

I have considered the feasibility of fine-tuning an LLM on the writings of a specific author. The idea is that it could aid writing in this way: If I currently am researching a specific author across multiple of their books, I often will get a quote of theirs trapped in my head some length of time after reading it. If I have neglected to jot down (or even to highlight) the source of the quote, I could ask the model where the remembered passage came from and get back a higher-quality response.

jncfhnb
0 replies
2h11m

Eh, but fine tuning is a very awkward tool to solve those knowledge problems imo.

Author style, maybe, I guess.

llmzero
7 replies
7h11m

I liked that you link to renting a dual 24GPU for 0.60cents/hour, but how long could it takes to fine tune a 70b model using your system (4 bits for weights)?

If I were a consumer I would be interested in the final price of fine tuning, for example a table with model size, training size, cost of training, and expected loss of quality with this technology.

One obvious question: Can you apply your technology with the recent (-1,0,1) encoding?, I think you will answers that the (-1,0,1) model is not available and you can't try it, but my question is whether once/if that model is available answer.ai will be able to use the same technology that this post to fine tune a big model in two very small GPUs, and then I should ask for a new table with cost/benefits analysis.

Edited: I should add that I find this kind of work very useful for enhancing individual users like me to be able to compete in the applications of LLM market, this is great work and along the lines of the book "from zero to one" (not that I like or dislike the author) to solve the kind of problem that nobody is trying to solve.

Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

danielhanchen
3 replies
5h44m

On how long, finetuning is influenced by your dataset size (more = slower), sequence length since attention is O(N^2), data movement etc and most important is how many steps you want to take. For QLoRA, some runs can do a few hundred steps which can complete in minutes to 1 hour. Too many can overfit. So being able to fit it on consumer GPUs can be very cost effective.

On the 1.58bit paper, from what I understand, this requires a total retraining from scratch. Hopefully the researchers will open source their weights :)

On the technicals, weights are encoded in (-1, 0, 1), whilst QLoRA uses a 4bit dynamic mapping of 16 numbers. The only change required would be the torch.matmul(X, W) step, where it'll be torch.bitlinear_matmul(X, W). Before with QLoRA, one has to do torch.matmul(X, dequantize(W)). So one has to implement torch.bitlinear_matmul. The backward is torch.bitlinear_matmul(dY, W.T).

miohtama
2 replies
1h17m

What's the magic in 1.58bit vs. 4 bit that it makes it so much more efficient (claimed)?

nyrikki
0 replies
54m

Really simple explanation is that for inference, feed forward networks are threshold circuits and by their nature ANNs are binary output, outputting true and false (same as being a threshold circuit)

So if you train your models with that in mind you're weighs can be reduced to -1,0,1 reducing the space complexity.

I don't think the costs in expressiveness are captured quite yet, but as perplexity doesn't care about correctness, if that is the metric that is important for you it will probably reduce memory requirements for inference.

danielhanchen
0 replies
1h3m

From what I understand, using (-1, 0, 1) removes multiplications in GPUs. Ie assume you have a weight matrix and multiply it by some activations

                   [-1, 0,  1]

                   [0,  1, -1]

    [10, 20, 30] x [1,  1,  0]
Instead of doing 10(-1) + 20(0) + 30(1) + 10(0) + ..., since we know beforehand the weights are simply (-1, 0, 1), we easily flip the sign and do addition, or force the hardware to do addition ie if (-1) do subtraction. If (0) do addition. If (1) do addition.

Floating point multiplication does addition of the exponents and multiplying of the mantissa. So just simplifying:

Float16 has E=5, M=10. Ie around 5 + 10^2 space needed = 105.

Bfloat16 has E=8, M=7. So 8 + 7^2 = 57 space.

Float8(143) E=4, M=3. So 4 + 3^2 = 13 space.

1.58(16bit) E=5, M=10. Addition only, so shift E say 5 + 10 addition = 15.

1.58(8bit) E=4, M=3. Addition only, so shift E say 4 + 3 addition = 7.

Obviously I'm simplifying, but with only additions, 1.58 uses say 7 space, whilst FP8 uses 13 space, so in theory 2x more transistors can be crammed, ie 2x more FLOPs than FP8.

swader999
0 replies
4h19m

I like how you think about social media.

jph00
0 replies
5h56m

As mentioned in the post, benchmarking results are coming in a later post. But in short: you can train an epoch of Alpaca in 24 hours or so, which is enough to get very significant change in model behavior.

airstrike
0 replies
36m

> Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

If you use Stylus (or any similar browser extension), I actually wrote a style to hide points for that very reason, replacing karma and scores with `•••`

This is actually the second time I see someone mentioning this need, so I've made it into a gist and published it to userstyles, but here's it is also since it's pretty short:

    @-moz-document domain("news.ycombinator.com") {
        /* Hide karma and points on replies */
        span.pagetop #karma, span.comhead span.score {
            visibility: hidden;
            position: relative;
            display: inline-block;
            height: 10px !important;
            overflow: hidden;
        }
        span.pagetop #karma {
            width: 0.8rem !important;
        }
        span.comhead span.score {
            width: 0.8rem !important;
        }
        span.pagetop #karma::before, span.comhead span.score::before {
            content: "•••";
            visibility: visible;
            overflow: hidden;
            opacity: 0.8;
            font-family: Helvetica, Arial, sans-serif !important;
        }
    }

https://gist.github.com/airstrike/62584e6ffb6104791c0ae48a8e...

https://userstyles.world/style/15164/hackernews-hide-karma-a...

itsgrimetime
4 replies
11h15m

Would be cool to build an “LLM@home” project like folding@home or SETI@home (rip), where tons of folks could donate their GPUs and train something huge and FOSS. I don’t know enough about how these models are trained though. Could it be chunked up and distributed in that way, then stitched/merged back together?

miohtama
0 replies
1h12m

Golem has been building this since 2017

https://www.golem.network/

They also have on option to get paid in crypto for your GPU power.

The challenge is that the AIsoftware architectures are not made "to run over Internet."

humansareok1
0 replies
2h25m

Always figured it would be too slow. Distributed training on clusters is usually done with 1+ gb/s interconnects.

eurekin
4 replies
8h16m

This might be the most interesting constructive approach in "Open Source" LLMs I've seen. Grounded, reasonable and inviting to replicate! I wish academia took that as a standard.

Great job!

carlossouza
2 replies
7h38m

Answer.ai is truly open AI. :)

rvz
1 replies
4h44m

That's what was said about OpenAI, Mistral, before the VCs and investors came in.

After that, the larger flagship AI models were then closed up again and used as an server only offering.

ericd
0 replies
4h1m

I doubt it, Jeremy’s been walking the walk for quite a while now, when it comes to opening up access to AI, especially with his excellent, free fast.ai course, it seems pretty clear that his primary motivations are in helping others. (If you’re in this thread, Jeremy, thanks for fast.ai, it helped me immensely in getting started in training models).

20wenty
0 replies
4h23m

For the most part this post was easy to read, and I could feel the collective excitement of the team. I came away feeling like I'd learned something and ready to try it myself. The only time the post gets a little fuzzy is "...store the quantized parameters in a selectable data type, where that storage data type is the same data type as the “computation type” of the mode". I assume "selectable datatype" is the float size of the quantization?

chasd00
4 replies
4h51m

What’s the best way for people to contribute to AI open source? I can’t produce things like this for many reasons so how can I and others like me do our part to keep SOTA AI open?

sophrocyne
1 replies
4h37m

There is a ton you can do to help SOTA AI remain open.

Join the community building the tools - Help with UI/UX, documentation, keeping up with the latest, and evangelizing whatever method the team building it has devised to keep it sustained.

Being part of the community itself is more valuable than you realize.

SamPatt
0 replies
40m

Where are you finding this community?

hamilyon2
0 replies
28m

I am random software engineer, but from what I learned high-quality open source data sets seems to be enabler. There is shortage of golden datasets for training and evaluation in every popular and niche area you can imagine.

ativzzz
0 replies
3h16m

Off the top of my head

- try to implement techniques that are doable on home hardware like the one described in OP (requires some $$$ investment) and give feedback or contribute to documentation / guides

- learn about different techniques and do educational writeups or documentation (like https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/)

- build a tool / library that wrap academic techinques and expose them more easily to end users (like A1111 or comfyUI for stable diffusion)

Anything that can translate the high end research down to something a moderately technical user can use or learn from is a gigantic win

yalok
3 replies
11h29m

Have you guys looked at using sparsification? It would probably require true re-training of the foundation model, to go at high sparse ratios (say 90% weights excluded), which could be done once on expensive GPU - but fine tuning such sparse models would require less RAM hopefully.

The trick with getting more benefit from sparse approach is to do block sparse (iirc, Tim Dettmers used to work on this as well, a few years ago), but large block size (say 16x16) would require much longer retraining to recover for the lost accuracy…

AhtiK
1 replies
10h7m

Has anyone seen an implementation of 'SpQR: A Sparse-Quantized Representation,' published in June 2023 by Tim Dettmers et al.? https://arxiv.org/abs/2306.03078

jph00
0 replies
10h42m

Yes, sparsification is another useful approach for higher efficiency, although block sparse kernels are pretty complex to work with -- especially when combined with quantization and LoRA! Most of the sparsity papers I've seen use "structured" sparsity; i.e removing layers, attention heads, and features. But the upside from this seems somewhat limited so far.

jph00
2 replies
12h28m

One thing I forgot to mention in the post which I think is kinda cool: at the NeurIPS Efficiency Challenge this year, where Tim Dettmers and I both did keynotes, every single top-ranked entry used QLoRA! The challenge was to create the most accurate model on a single GPU in 24 hours.

I think that is a great example of how important and useful QLoRA is. Maybe we should run a dual-GPU challenge next time not that multi-GPU is working...

3abiton
1 replies
7h37m

Are any of the NIPS resources available online?

curl-up
2 replies
2h12m

Does anyone have sources, or experience, about fine tuning primarily to teach the model some factual data, especially when it comes to later "higher level" question answering.

For example, giving the model a bunch of text (academic papers and such) about 19th century writers, then asking things like "Who were the main influences on writer X"?

Obviously simple RAG-like approaches don't work, as such information is rarely available in the text as-is, and needs to be "extrapolated" to some extent. Long context models might work (just dumping everything into the prompt), but are way too expensive for my needs.

armcat
1 replies
1h52m

RAG approaches should work quite well for the examples you mentioned. It's a matter of how you approach the retrieval part - you can opt for a larger recall on retrieval, and leverage the large context window for the LLM to figure out the answer. Even if it's not "as-is", semantically if it's in there, it should be able to find it.

Other things to try out is how you approach the question "expansion" part, for example using Hypothetical Document Embeddings (HyDE); or how you approach the filtering-out part, e.g. using "System 2 Attention", https://arxiv.org/abs/2311.11829.

curl-up
0 replies
1h44m

I tried most of such techniques, but the point is that this information really isn't in there directly, and to perform the question expansion, the model needs to know about the domain already.

For example, imagine that one paper is about how author X was French, in early 19th-c.and how they were one of the first ones to write about topic T. Another paper is about how author Y was inspired by the early 19th-c. French writers writing about T. However, this second article does not mention X at all. Asking about "who were the main influences on X" would not give you the second article.

Of course, I could run "multiple-hop" RAG-like process, where the model keeps asking questions itself and so on in a loop, but this becomes extremely clumsy, and the models (even GPT-4) tend to get out of hand. It is also extremely slow, of course.

zerop
1 replies
5h20m

Question - Can I use this to retrain an LLM (70B) weights on my own data? I am using RAG as of now for asking questions on my text, but always wonder if I could retrain an LLM on my own text. Thoughts?

AnthusAI
0 replies
4h43m

Fine tuning is generally not the best way to teach an LLM new knowledge. RAG is still more appropriate. Fine tuning is generally more effective for controlling the format of the responses but it's not going to teach the model a lot of new concepts. The model can learn how to handle new vocabulary through fine tuning but it's not a great way to teach the model new facts. Giving it access to a knowledge base is a better way to do that.

ricopags
1 replies
19h33m

This is such exciting news! Huge thanks to you for your continued work in making sense of AI.

I wonder if the recent Bitnet 1.58 paper [the use of ternary bits in lieu of fp/int] might be an advancement that could further reduce the computation required for inference?

jph00
0 replies
12h38m

Yes, along with the many other <4 bit quant methods recently developed -- there's been a wonderful boom in low-bit quant methods in the last 6 months, and we've got our own ideas for taking them further too. Along with QLoRA/FSDP, we're likely to see big advances in model training this year on consumer hardware.

pella
1 replies
12h22m

the ability to use multiple GPUs with QLoRA training.

Thorough article!

Question: What's your opinion on:

- How viable will NVIDIA's consumer cards be in the long run?

- Besides https://tinygrad.org, what other cost-effective future alternatives could there be?

bugglebeetle
0 replies
11h1m

Unsloth (mentioned in the Answer.AI post) is planning multi-GPU support in a future release.

openquery
1 replies
3h0m

This article is very well written and super informative. One thing I didn't understand is:

At Answer.AI our north star is making useful AI more accessible. $150,000 to create your own high-quality personalized model definitely doesn’t count as accessible!

Renting an A100 on RunPod is ~$1.89 / hour. So you'd need ~80,000 A100 hours to train a useful AI model?

humansareok1
0 replies
2h35m

In the post it explicitly says you can train on 2 3090 level cards which are significantly cheaper and the headline literally says "Finetune" Not "Pretrain"

m3kw9
1 replies
11h56m

If they can continuously train it, it could be better than a large context as this is how a AI OS would need to work when you have constant updates to your files

padolsey
0 replies
11h9m

I don’t think you’d be fine-tuning a whole model in such cases. That seems over the top, no? I assume you’d get sufficiently far with big context windows, vector search, RAG. Etc.

keeptrying
1 replies
11h47m

If you are gonna be doing stuff like this I’m damn excited for answer.ai!

It’ll be the first time we’ll have someone who knows AI create leverage to open source it.

Way to go!

chasd00
0 replies
5h45m

It’ll be the first time we’ll have someone who knows AI create leverage to open source it.

It can’t be overstated how important this is. Thank you again.

jamesblonde
1 replies
11h24m

This is a fantastic breakthrough for those of us who fine-tune LLMs on limited hardware budgets.

I was curious about the choice of FSDP over DeepSpeed. I have been using Axolotl for fine-tuning, and FSDP has been broken there, whilst DeepSpeed is rock solid. Why FSDP over DeepSpeed jph00?

jph00
0 replies
10h46m

DeepSpeed has more features than FSDP, but it's much more complex to hack on -- FSDP is written directly in python using calls to the PyTorch library, whereas DeepSpeed is 20% C++ and 10% CUDA (according to the GitHub stats).

We've found that FSDP works just as well for our needs, and we appreciated the increased "hackability".

(Axolotl is terrific BTW. I hadn't heard of problems with it with FSDP before -- I'll see if that's something we can help with.)

delegate
1 replies
9h11m

Maybe I've missed it in the article - but how long would a full training run take on 2 consumer GPUs (local or rented) ? Ballpark - hours, days... ?

gardnr
0 replies
7h38m

The author is discussing fine-tuning a base model. How long it takes really depends on the dataset, the method, and the hyperparameters. DPO, for example, can achieve some great results with a fraction of the steps of other methods.

Just like with unsloth or axolotl, the people that use this will have to make compromises that give results in a reasonable amount of time.

carbocation
1 replies
12h13m

I wonder whether LoRAs could be useful for U-Net training. Especially thinking of CNN-based U-Net models with pre-trained encoders (but randomly initialized decoders). At least, it seems possible that normal weight updates on the decoder and LoRA training on the encoder could improve efficiency.

jph00
0 replies
10h51m

Diffusion unet has an "extended" version nowadays that applies to the resnet part as well as the cross-attention: https://github.com/cloneofsimo/lora

Kelteseth
1 replies
6h52m

Any plans on supporting AMD? In Germany, the price of an 7900XTX is HALF of a NV 4090...

tbenst
0 replies
10h26m

Very interesting but hard to interpret until the performance numbers / benchmarks are available. I can already fine-tune a 70B language model at home using CPU + RAM, but it would be so slow as to be almost totally impractical (~20x slower than GPU). It would be great to see a comparison to eg 8 x A100 (available for $32/hr on AWS on-demand) and also CPU + RAM. Presumably it’s somewhere in between, but hard to predict where!

sieszpak
0 replies
4h23m

4x 3080???

pama
0 replies
5h37m

Thank you for the repo and write up. What tools (if any) did you use for performance tuning once you achieved the main goal of being able to finetune the model?

lbj
0 replies
10h47m

Can't believe they didn't name this Qolor

jl6
0 replies
5h8m

Besides being a great result, the quality and clarity of the technical writing here is excellent.

jiwidi
0 replies
2h0m

home

two 24GB GPUs.

geez

iandanforth
0 replies
4h57m

This is great, however there were many opportunities to use the word 'nibble' in this post and they were all missed.

hathym
0 replies
7h58m

Imagine the potential of a Folding@Home-inspired project for AI development. What kind of powerful model could a community of gamers and GPU owners create.

g42gregory
0 replies
11h42m

This is brilliant. Thank you for doing his!

ericd
0 replies
1h41m

This is the best news I’ve seen all month. I think one of the great near-term dangers of AI is the bulk of the economic benefit going mainly to relatively few companies. That risk seems substantially reduced if they have to compete with a great variety of models.

chompychop
0 replies
6h22m

Does this support multimodal language models (E.g.: LLaVA)?

buildbot
0 replies
12h46m

Nice, I tried to use QLoRA+FSDP in the past with litgpt and obviously at that time it did not work. This is very useful!

Tostino
0 replies
1h26m

Nice, i've been hoping this would be possible for a while. I'll have to do a new fine-tune of Inkbot on top of one of the 70b models.

What are the max context lengths / batch sizes you can train at with this method for 2x24gb? What about 4x24gb?

Nouser76
0 replies
1h58m

Is there any framework/system that distributes the work across multiple GPUs on different computers over a network (LAN or WAN)? I'm not concerned much about latency or generation time, but would love to train or load up huge models and send jobs to run overnight.