return to table of content

LoRA from scratch: implementation for LLM finetuning

rsweeney21
24 replies
1d23h

It's still strange to me to work in a field of computer science where we say things like "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

r3trohack3r
8 replies
1d22h

I feel like it's the difference between something that has been engineered and something that has been discovered.

I feel like most of our industry up until now has been engineered.

LLMs were discovered.

arketyp
2 replies
1d22h

I understand your distinction, I think, but I would say it is more engineering than ever. It's like the early days of the steam engine or firearms development. It's not a hard science, not formal analysis, it's engineering: tinkering, testing, experimenting, iterating.

peddling-brink
0 replies
1d21h

tinkering, testing, experimenting, iterating

But that describes science. http://imgur.com/1h3K2TT/

amelius
0 replies
1d21h

AI requires a lot of engineering. However, the engineering is not what makes working in AI interesting. It's the plumbing, basically.

herval
1 replies
1d14h

LLMs were very much engineered... the exact results they yield are hard to determine since they're large statistical models, but I don't think that categorizes the LLMs themselves as a 'discovery' (like say Penicilin)

baq
0 replies
1d10h

There’s an argument that all maths are discovered instead of invented or engineered. LLM hardware certainly is hard engineering but the numbers you put in it aren’t, once you have them; if you stumbled upon them by chance or they were revealed to you in your sleep it’d work just as well. (‘ollama run mixtral’ is good enough for a dream to me!)

mejutoco
0 replies
1d10h

I believe, from what I saw in Mathematics, this is a matter of taste. Discovered or invented are 2 perspectives. Some people prefer to think that light is reaching in previously dark corners of knowledge waiting to be discovered(discover). Others prefer to think that by force of genius they brought the thing into the world.

To me, personally, these are 2 sides of the coin, without one having more proof than the other.

justanotheratom
0 replies
1d22h

and finally, this justifies the "science" in Computer Science.

SkyMarshal
0 replies
1d21h

If the Black Swan model of science is true, then most of the consequential innovations and advances are discovered rather than engineered.

CamperBob2
2 replies
1d21h

This can be laid at the feet of Minsky and others who dismissed perceptrons because they couldn't model nonlinear functions. LLMs were never going to happen until modern CPUs and GPUs came along, but that doesn't mean we couldn't have a better theoretical foundation in place. We are years behind where we should be.

When I worked in the games industry in the 1990s, it was "common knowledge" that neural nets were a dead end at best and a con job at worst. Really a shame to lose so much time because a few senior authority figures warned everyone off. We need to make sure that doesn't happen this time.

spidersenses
1 replies
1d20h

What is the point you're trying to make?

CamperBob2
0 replies
1d20h

What is the point you're trying to make?

Answering the GP's point regarding why deep learning textbooks, articles, and blog posts are full of sentences that begin with "We think..." and "We're not sure, but..." and "It appears that..."

What's yours?

thatguysaguy
0 replies
1d20h

I haven't seen this key/buzzword mentioned yet, so I think part of it is the fact that we're now working on complex systems. This was already true (a social network is a complex system), but now we have the impenetrability of a complex system within the scope of a single process. It's hard to figure out generalizable principles about this kind of thing!

stormfather
0 replies
1d21h

It's how God programs

raxxorraxor
0 replies
1d8h

Welcome to engineering. We don't sketch our controlled systems and forget all about systems theory. Instead we just fiddle with out controllers until the result is acceptable.

manojlds
0 replies
1d23h

Divine benevolence

jncfhnb
0 replies
1d5h

Not strange at all. This is largely how biology operates. These things are simpler than bio and more complex than programs

jejeyyy77
0 replies
1d22h

it's a new paradigm

fierro
0 replies
1d19h

we have no theories of intelligence. We're like people in the 1500s trying to figure out why and how people get sick, with no concept of bacteria, germs, transmission, etc

amelius
0 replies
1d21h

AI is more like gardening than engineering. You try things without knowing the outcome. And you wait a very long time to see the outcome.

UberFly
0 replies
1d21h

This is what researching different Stable Diffusion settings is like. You quickly learn that there's a lot of guessing going on.

TacticalCoder
0 replies
1d21h

"we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.

We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.

I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.

SkyMarshal
0 replies
1d21h

That bottom-up tinkering is kinda how CS started in the US, as observed by Dijkstra himself: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

Ideally we want theoretical foundations, but sometimes random explorations are necessary to tease out enough data to construct or validate theory.

FuckButtons
0 replies
1d14h

I mean, it’s kind of in the name isn’t it? Computer science. Science is empirical, often poorly understood and even the best theories don’t fully explain all observations, especially when a field gets new tools to observe phenomena. It takes a while for a good theory to come along and make sense of everything in science and that seems like more or less exactly where we are today.

denysvitali
12 replies
1d22h

LoRA != LoRa. I keep on getting confused and hate that they chose to reuse an existing acronym

sbrother
3 replies
1d21h

Wait, what is the meaning other than "Low-Rank Adaptation"? It's hard to google the difference.

jcuenod
0 replies
1d4h

Trying asking an LLM :)

cristoperb
0 replies
1d21h

It's the name of a "Lo"ng "Ra"nge wifi-like technology:

https://en.wikipedia.org/wiki/LoRa

boolemancer
0 replies
1d21h

I assume the radio technology:

https://en.wikipedia.org/wiki/LoRa

sschueller
2 replies
1d22h

It's unfortunate that those two so far unrelated technologies have the same acronym.

girvo
0 replies
1d16h

LoRa the radio tech was first, so as far as I'm concerned it's the canonical definition of the acronym. But I'm biased, I'm an embedded firmware dev

dvngnt_
0 replies
12h51m

probably better than them being similar but not exactly since context still helps

esafak
1 replies
1d17h

That's what happens when people specialize and don't pay attention to what's going on outside their bubble.

HKH2
0 replies
1d16h

A quick websearch could fix that.

daemonologist
1 replies
1d22h

Likewise. My day job is machine learning and I still, or maybe consequently, do a double-take every time I see the acronym with minimal context (like on the HN front page, where either usage would be normal).

travisgriggs
0 replies
1d17h

And my day job involves a lot of LoRa. I always do a double take on these. I'm grateful that at least the caps is now being done differently.

blopp99
0 replies
1d17h

I hate the trend of software guys naming things after hardware related stuff

dymk
5 replies
2d

Not to be confused with LoRa ("long range"), a radio communication protocol. At first I thought this could be about using LLMs to find optimal protocol parameters, but alas.

thelastparadise
0 replies
2d

This caught me off-guard as well.

I really wish they could have used abother acronym.

the__alchemist
0 replies
1d23h

Concur; or at least don't use a mix of lower and upper-case, like the radio. I think there would be less mis-assumptions if they had called it "LORA", "Lora", "lora" etc. "LoRA" is asking for trouble.

rasbt
0 replies
2d

Hah, yeah that's LoRA as in Low-Rank Adaptation :P

cpfohl
0 replies
2d

I had the exact same confusion

OJFord
0 replies
2d

It's the first thing that comes to my mind too, but this is mentioned in every thread (and there are far more of them for LoRA than LoRa atm), and in this case there's unlikely to be much confusion because it starts by spelling out the acronym: 'LoRA, which stands for Low Rank Adaptation, [...]'.

chenxi9649
5 replies
1d23h

It's still not too clear to me when we should fine tune versus RAG.

In the past, I used to believe that finetuning is mostly for model behavioral change, but recently it seems that certain companies are also using fine-tuning for knowledge addition.

What are the main use cases for fine tuning?

pizza
1 replies
1d14h

These are autoregressive models. When you have a new type of sequence where future elements are able to be predicted from previous parts of the sequence, but in a new kind of way than the models have seen before, it would make sense to finetune.

Admittedly, that's a pretty vague descriptor for how to decide what to do for a given data scenario, but it might be good enough as a rough heuristic. Now, whether knowledge addition falls under that, might be a question of taste (without experiments).

jcuenod
0 replies
1d4h

Exactly this. If you have a model that's never seen JSON and you want JSON to come out, fine-tuning probably not a bad idea. If you have a model trained on English documents and you want it to produce English documents related to your company, you don't need to fine-tune.

rasbt
0 replies
1d23h

I think the main use case remains behavior changes: instruction finetuning, finetuning for classification, etc. Knowledge addition to the weights is best done via pretraining. Or, if you have an external database or documentation that you want to query during the generation, RAG as you mention.

PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA (quantized LoRA).

ignoramous
0 replies
1d22h

From what I gather, fine-tuning is unreasonably effective [0] because in-context learning really depends on how powerful the underlying model is and just how you do RAG (process queries, retrieve embeddings, rank outcomes, etc [1]). Per this paper I read, fine-tuning may add new domain knowledge (but as another commenter pointed out, knowledge is better represented from data of the pre-training stage) or boost specific knowledge; while RAG is limited to boosting only; nevertheless, both techniques turn out to be similarly capable with different trade-offs [2].

--

[0] Fast.ai: Can Models learn from one sample, https://www.fast.ai/posts/2023-09-04-learning-jumps/ / https://archive.is/eJMPR

[1] LlamaIndex: Advanced RAG, https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-fo... / https://archive.is/qtBXX

[2] Microsoft: RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study, https://arxiv.org/html/2401.08406v2#S6 / https://archive.is/UQ8Sa#S6

CuriouslyC
0 replies
1d22h

Fine tuning is better than RAG when the additional data isn't concise, or requires context. This is because too much context (or "unfocused" context) can dilute prompt following behavior, and RAG doesn't help the model with higher order token associations so you have to get lucky and pull what you need from the augmentation material, at which point it's not much better than a fancy search engine. Of course this is mostly an issue when you're dealing with a specialized corpus with its own micro-dialect that isn't well represented in public data sets, such as with government/big corporation internal documents.

somethingsome
4 replies
1d20h

Nice article, I'm not in this field, however, my understanding of the original paper was that the LoRA was applied only on the last dense layer, and not to all independently (maybe I misread it originally).

Digging a bit in why the implementation is like this in the link, I found that in QLoRA they used this and it seems to have some interesting effects, maybe adding a note on the QLoRA decision would be nice :)

I'm not sure I understand why it works though, my neophyte view was that applying LoRA to the last layer made sense, but, I do not wrap my mind on the rationale of applying it repeadly to each linear layer. Can someone explain their intuition?

icyfox
3 replies
1d20h

Like most things in ML, the answer of which layers to use come down to empirical evidence more than theory. In a typical Lora training pipeline, you freeze the contents of the base model and just adjust the Lora layers. The more layers you convert to lora layers the more degrees of freedom you have for the optimization.

There are some finetuning regimens that only recommend finetuning the last layer since this is theorized to have the "highest order" representation of the inputs. Other training regimens will finetune all layers. It's largely data and problem dependent. Lora just mirrors this convention.

somethingsome
2 replies
1d18h

Yeah, but if I remember correctly the paper, LoRA followed the logic that only the last layers on a llm changed drastically during finetuning, and the layers above remained almost unchanged, so it made sense to alterate only the last ones, breaking this by adding a LoRA at each linear layer doesn't seem to follow the logic of why LoRA was created and why it works.

icyfox
1 replies
1d17h

Well, Lora works just because it's a low rank approximation of full updates - much in the same way that SVD works, and regular gradient updating works. It delivers good results by both acting as a regularizer and by allowing larger models to be updated with smaller memory footprints.

My point is that the original Lora paper choosing the last layer is one choice. And it is likely the most common one because of its higher symbolic nature typically being all that's needed for good performance on downstream tasks.

Depending on the size of your finetuning job I've personally seen updating more layers (or updating some only on a certain learning rate schedule) to be more effective. Lora is just the mathematical technique of updating, it doesn't really have a hypothesis on the ideal training regimen.

somethingsome
0 replies
1d13h

Thanks, I'll meditate on that and re read the paper with this view in mind.

The last sentence makes sense to me, if the finetuning job changes significatively more the weights of other layers than just the last one, it is kinda normal to to use Lora on them. I had the impression that it was rarely the case, but I must be mistaken. I'll think about applications where this is the case.

andy99
3 replies
2d

"From scratch" seems to be a matter of opinion. "Pure pytorch" maybe, except it uses HF transformers. So it's LoRA on top of common frameworks...

rasbt
0 replies
2d

Yeah, the LoRA part is from scratch. The LLM backbone in this example is not, this is to provide a concrete example. But you could apply the exact same LoRA from scratch code to a pure PyTorch model if you wanted to:

E.g.

    class MultilayerPerceptron(nn.Module):

        def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
            super().__init__()

            self.layers = nn.Sequential(
                nn.Linear(num_features, num_hidden_1),
                nn.ReLU(),
                nn.Linear(num_hidden_1, num_hidden_2),
                nn.ReLU(),
                nn.Linear(num_hidden_2, num_classes)
            )

        def forward(self, x):
            x = self.layers(x)
            return x

    model = MultilayerPerceptron(
        num_features=num_features,
        num_hidden_1=num_hidden_1,
        num_hidden_2=num_hidden_2, 
        num_classes=num_classes
    )

    model.layers[0] = LinearWithLoRA(model.layers[0], rank=4, alpha=1)
    model.layers[2] = LinearWithLoRA(model.layers[2], rank=4, alpha=1)
    model.layers[4] = LinearWithLoRA(model.layers[4], rank=4, alpha=1)

michaelnny
0 replies
1d11h

If anyone is interested in a more 'pure' or 'scratch' implementation, check out https://github.com/michaelnny/QLoRA-LLM. (author here) It also supports 4-bit quantized LoRA, using only PyTorch and bitsandbytes, without any other tools.

2024throwaway
0 replies
2d

This apple pie recipe claims to be from scratch, but they cooked it in an off the shelf oven. So it's from scratch on top of the universe...

jamesblonde
2 replies
1d22h

I prefer the not from scratch, but from configuration approach by Axolotl. Aolotl supports fine-tuning mistral, llama-2, with lots of the latest techniques - sample packing, flash attention, xformers.

I concentrate on collecting and curating the fine-tuning data, do "data-centric" fine-tuning - not learning LoRA from scratch.

wfalcon
1 replies
1d19h

this is also what our (Lightning AI) lit-gpt library does. https://github.com/Lightning-AI/lit-gpt

jamesblonde
0 replies
1d12h

Thanks, hadn't seen this.

mintrain
1 replies
1d19h

I've added an exercise to practice implementing the LoRA forward pass from scratch: https://tensorgym.com/exercises/17 The idea behind LoRA is beautiful, and the implementation is pretty straightforward.

Rudeg
0 replies
1d18h

nice, looks very cool and useful! I'll definitely try it!

ijhuygft776
1 replies
1d23h

I wish the wireless LoRa protocol would be open source...

ijhuygft776
0 replies
1d
ignoramous
1 replies
2d

I've been keeping track of the techniques through Maxime Labonne's LLMs 101: https://github.com/mlabonne/llm-course#4-supervised-fine-tun...

pama
0 replies
1d23h

Thanks for the resource. It seems useful enough to warrant its own thread here.

helloericsf
1 replies
1d19h

HN friends, What are the most popular libraries for fine-tuning? (Not from scratch)

jasonjmcghee
0 replies
1d16h
fnordfnordfnord
1 replies
1d15h

I thought this was going to be some neat software defined radio stuff. Still quite interesting though.

z3ugma
0 replies
1d15h

it's all about whether the 'A' is capitalized or not. LoRa - radio LoRA - machine learning

facu17y
1 replies
1d22h

What's the performance penalty of LoRA?

rasbt
0 replies
1d21h

During training, it's more efficient than full finetuning because you only update a fraction of the parameters via backprop. During inference, it can ...

1) ... be theoretically a tad slower if you add the LoRA values dynamically during the forward pass (however, this is also an advantage if you want to keep a separate small weight set per customer, for example; you run only one large base model and can apply the different LoRA weights per customer on the fly)

2) ... have the exact same performance as the base model if you merge the LoRA weights back with the base model.

yandrypozo
0 replies
1d21h

gotta say naming is hard I thought this was about LoRa (from "long range") or LoRaWAN, the IoT sensors communication.

tussa
0 replies
1d11h

It's cheap and sleazy to steal a name from another project to ride it's fame.

huqedato
0 replies
2d

Excellent and practical example! I'm curious if there's a comparable one using Julia or JavaScript.

gourabmi
0 replies
1d23h

Someone somewhere is already working on naming their project Lehsun.. /s

broabprobe
0 replies
1d23h

wow definitely thought this was about LoRa at first.