LoRA from scratch: implementation for LLM finetuning

It's still strange to me to work in a field of computer science where we say things like "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

I feel like it's the difference between something that has been engineered and something that has been discovered.

I feel like most of our industry up until now has been engineered.

LLMs were discovered.

I understand your distinction, I think, but I would say it is more engineering than ever. It's like the early days of the steam engine or firearms development. It's not a hard science, not formal analysis, it's engineering: tinkering, testing, experimenting, iterating.

tinkering, testing, experimenting, iterating

But that describes science. http://imgur.com/1h3K2TT/

AI requires a lot of engineering. However, the engineering is not what makes working in AI interesting. It's the plumbing, basically.

LLMs were very much engineered... the exact results they yield are hard to determine since they're large statistical models, but I don't think that categorizes the LLMs themselves as a 'discovery' (like say Penicilin)

There’s an argument that all maths are discovered instead of invented or engineered. LLM hardware certainly is hard engineering but the numbers you put in it aren’t, once you have them; if you stumbled upon them by chance or they were revealed to you in your sleep it’d work just as well. (‘ollama run mixtral’ is good enough for a dream to me!)

I believe, from what I saw in Mathematics, this is a matter of taste. Discovered or invented are 2 perspectives. Some people prefer to think that light is reaching in previously dark corners of knowledge waiting to be discovered(discover). Others prefer to think that by force of genius they brought the thing into the world.

To me, personally, these are 2 sides of the coin, without one having more proof than the other.

and finally, this justifies the "science" in Computer Science.

If the Black Swan model of science is true, then most of the consequential innovations and advances are discovered rather than engineered.

This can be laid at the feet of Minsky and others who dismissed perceptrons because they couldn't model nonlinear functions. LLMs were never going to happen until modern CPUs and GPUs came along, but that doesn't mean we couldn't have a better theoretical foundation in place. We are years behind where we should be.

When I worked in the games industry in the 1990s, it was "common knowledge" that neural nets were a dead end at best and a con job at worst. Really a shame to lose so much time because a few senior authority figures warned everyone off. We need to make sure that doesn't happen this time.

What is the point you're trying to make?

What is the point you're trying to make?

Answering the GP's point regarding why deep learning textbooks, articles, and blog posts are full of sentences that begin with "We think..." and "We're not sure, but..." and "It appears that..."

What's yours?

I haven't seen this key/buzzword mentioned yet, so I think part of it is the fact that we're now working on complex systems. This was already true (a social network is a complex system), but now we have the impenetrability of a complex system within the scope of a single process. It's hard to figure out generalizable principles about this kind of thing!

It's how God programs

Welcome to engineering. We don't sketch our controlled systems and forget all about systems theory. Instead we just fiddle with out controllers until the result is acceptable.

Divine benevolence

Not strange at all. This is largely how biology operates. These things are simpler than bio and more complex than programs

it's a new paradigm

we have no theories of intelligence. We're like people in the 1500s trying to figure out why and how people get sick, with no concept of bacteria, germs, transmission, etc

AI is more like gardening than engineering. You try things without knowing the outcome. And you wait a very long time to see the outcome.

This is what researching different Stable Diffusion settings is like. You quickly learn that there's a lot of guessing going on.

"we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.

We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.

I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.

That bottom-up tinkering is kinda how CS started in the US, as observed by Dijkstra himself: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

Ideally we want theoretical foundations, but sometimes random explorations are necessary to tease out enough data to construct or validate theory.

I mean, it’s kind of in the name isn’t it? Computer science. Science is empirical, often poorly understood and even the best theories don’t fully explain all observations, especially when a field gets new tools to observe phenomena. It takes a while for a good theory to come along and make sense of everything in science and that seems like more or less exactly where we are today.

LoRA != LoRa. I keep on getting confused and hate that they chose to reuse an existing acronym

Wait, what is the meaning other than "Low-Rank Adaptation"? It's hard to google the difference.

Trying asking an LLM :)

It's the name of a "Lo"ng "Ra"nge wifi-like technology:

https://en.wikipedia.org/wiki/LoRa

I assume the radio technology:

https://en.wikipedia.org/wiki/LoRa

It's unfortunate that those two so far unrelated technologies have the same acronym.

LoRa the radio tech was first, so as far as I'm concerned it's the canonical definition of the acronym. But I'm biased, I'm an embedded firmware dev

probably better than them being similar but not exactly since context still helps

That's what happens when people specialize and don't pay attention to what's going on outside their bubble.

A quick websearch could fix that.

Likewise. My day job is machine learning and I still, or maybe consequently, do a double-take every time I see the acronym with minimal context (like on the HN front page, where either usage would be normal).

And my day job involves a lot of LoRa. I always do a double take on these. I'm grateful that at least the caps is now being done differently.

I hate the trend of software guys naming things after hardware related stuff

Not to be confused with LoRa ("long range"), a radio communication protocol. At first I thought this could be about using LLMs to find optimal protocol parameters, but alas.

This caught me off-guard as well.

I really wish they could have used abother acronym.

Concur; or at least don't use a mix of lower and upper-case, like the radio. I think there would be less mis-assumptions if they had called it "LORA", "Lora", "lora" etc. "LoRA" is asking for trouble.

Hah, yeah that's LoRA as in Low-Rank Adaptation :P

I had the exact same confusion

It's the first thing that comes to my mind too, but this is mentioned in every thread (and there are far more of them for LoRA than LoRa atm), and in this case there's unlikely to be much confusion because it starts by spelling out the acronym: 'LoRA, which stands for Low Rank Adaptation, [...]'.

It's still not too clear to me when we should fine tune versus RAG.

In the past, I used to believe that finetuning is mostly for model behavioral change, but recently it seems that certain companies are also using fine-tuning for knowledge addition.

What are the main use cases for fine tuning?

These are autoregressive models. When you have a new type of sequence where future elements are able to be predicted from previous parts of the sequence, but in a new kind of way than the models have seen before, it would make sense to finetune.

Admittedly, that's a pretty vague descriptor for how to decide what to do for a given data scenario, but it might be good enough as a rough heuristic. Now, whether knowledge addition falls under that, might be a question of taste (without experiments).

Exactly this. If you have a model that's never seen JSON and you want JSON to come out, fine-tuning probably not a bad idea. If you have a model trained on English documents and you want it to produce English documents related to your company, you don't need to fine-tune.

I think the main use case remains behavior changes: instruction finetuning, finetuning for classification, etc. Knowledge addition to the weights is best done via pretraining. Or, if you have an external database or documentation that you want to query during the generation, RAG as you mention.

PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA (quantized LoRA).

From what I gather, fine-tuning is unreasonably effective [0] because in-context learning really depends on how powerful the underlying model is and just how you do RAG (process queries, retrieve embeddings, rank outcomes, etc [1]). Per this paper I read, fine-tuning may add new domain knowledge (but as another commenter pointed out, knowledge is better represented from data of the pre-training stage) or boost specific knowledge; while RAG is limited to boosting only; nevertheless, both techniques turn out to be similarly capable with different trade-offs [2].

[0] Fast.ai: Can Models learn from one sample, https://www.fast.ai/posts/2023-09-04-learning-jumps/ / https://archive.is/eJMPR

[1] LlamaIndex: Advanced RAG, https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-fo... / https://archive.is/qtBXX

[2] Microsoft: RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study, https://arxiv.org/html/2401.08406v2#S6 / https://archive.is/UQ8Sa#S6

Fine tuning is better than RAG when the additional data isn't concise, or requires context. This is because too much context (or "unfocused" context) can dilute prompt following behavior, and RAG doesn't help the model with higher order token associations so you have to get lucky and pull what you need from the augmentation material, at which point it's not much better than a fancy search engine. Of course this is mostly an issue when you're dealing with a specialized corpus with its own micro-dialect that isn't well represented in public data sets, such as with government/big corporation internal documents.

Nice article, I'm not in this field, however, my understanding of the original paper was that the LoRA was applied only on the last dense layer, and not to all independently (maybe I misread it originally).

Digging a bit in why the implementation is like this in the link, I found that in QLoRA they used this and it seems to have some interesting effects, maybe adding a note on the QLoRA decision would be nice :)

I'm not sure I understand why it works though, my neophyte view was that applying LoRA to the last layer made sense, but, I do not wrap my mind on the rationale of applying it repeadly to each linear layer. Can someone explain their intuition?

Like most things in ML, the answer of which layers to use come down to empirical evidence more than theory. In a typical Lora training pipeline, you freeze the contents of the base model and just adjust the Lora layers. The more layers you convert to lora layers the more degrees of freedom you have for the optimization.

There are some finetuning regimens that only recommend finetuning the last layer since this is theorized to have the "highest order" representation of the inputs. Other training regimens will finetune all layers. It's largely data and problem dependent. Lora just mirrors this convention.

Yeah, but if I remember correctly the paper, LoRA followed the logic that only the last layers on a llm changed drastically during finetuning, and the layers above remained almost unchanged, so it made sense to alterate only the last ones, breaking this by adding a LoRA at each linear layer doesn't seem to follow the logic of why LoRA was created and why it works.

Well, Lora works just because it's a low rank approximation of full updates - much in the same way that SVD works, and regular gradient updating works. It delivers good results by both acting as a regularizer and by allowing larger models to be updated with smaller memory footprints.

My point is that the original Lora paper choosing the last layer is one choice. And it is likely the most common one because of its higher symbolic nature typically being all that's needed for good performance on downstream tasks.

Depending on the size of your finetuning job I've personally seen updating more layers (or updating some only on a certain learning rate schedule) to be more effective. Lora is just the mathematical technique of updating, it doesn't really have a hypothesis on the ideal training regimen.

Thanks, I'll meditate on that and re read the paper with this view in mind.

The last sentence makes sense to me, if the finetuning job changes significatively more the weights of other layers than just the last one, it is kinda normal to to use Lora on them. I had the impression that it was rarely the case, but I must be mistaken. I'll think about applications where this is the case.

"From scratch" seems to be a matter of opinion. "Pure pytorch" maybe, except it uses HF transformers. So it's LoRA on top of common frameworks...

Yeah, the LoRA part is from scratch. The LLM backbone in this example is not, this is to provide a concrete example. But you could apply the exact same LoRA from scratch code to a pure PyTorch model if you wanted to:

E.g.

    class MultilayerPerceptron(nn.Module):

        def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
            super().__init__()

            self.layers = nn.Sequential(
                nn.Linear(num_features, num_hidden_1),
                nn.ReLU(),
                nn.Linear(num_hidden_1, num_hidden_2),
                nn.ReLU(),
                nn.Linear(num_hidden_2, num_classes)
            )

        def forward(self, x):
            x = self.layers(x)
            return x

    model = MultilayerPerceptron(
        num_features=num_features,
        num_hidden_1=num_hidden_1,
        num_hidden_2=num_hidden_2, 
        num_classes=num_classes
    )

    model.layers[0] = LinearWithLoRA(model.layers[0], rank=4, alpha=1)
    model.layers[2] = LinearWithLoRA(model.layers[2], rank=4, alpha=1)
    model.layers[4] = LinearWithLoRA(model.layers[4], rank=4, alpha=1)

If anyone is interested in a more 'pure' or 'scratch' implementation, check out https://github.com/michaelnny/QLoRA-LLM. (author here) It also supports 4-bit quantized LoRA, using only PyTorch and bitsandbytes, without any other tools.

This apple pie recipe claims to be from scratch, but they cooked it in an off the shelf oven. So it's from scratch on top of the universe...

I prefer the not from scratch, but from configuration approach by Axolotl. Aolotl supports fine-tuning mistral, llama-2, with lots of the latest techniques - sample packing, flash attention, xformers.

I concentrate on collecting and curating the fine-tuning data, do "data-centric" fine-tuning - not learning LoRA from scratch.

this is also what our (Lightning AI) lit-gpt library does. https://github.com/Lightning-AI/lit-gpt

Thanks, hadn't seen this.

I've added an exercise to practice implementing the LoRA forward pass from scratch: https://tensorgym.com/exercises/17 The idea behind LoRA is beautiful, and the implementation is pretty straightforward.

nice, looks very cool and useful! I'll definitely try it!

I wish the wireless LoRa protocol would be open source...

https://www.epfl.ch/labs/tcl/wp-content/uploads/2020/02/Reve...

I've been keeping track of the techniques through Maxime Labonne's LLMs 101: https://github.com/mlabonne/llm-course#4-supervised-fine-tun...

Thanks for the resource. It seems useful enough to warrant its own thread here.

HN friends, What are the most popular libraries for fine-tuning? (Not from scratch)

https://github.com/OpenAccess-AI-Collective/axolotl

I thought this was going to be some neat software defined radio stuff. Still quite interesting though.

it's all about whether the 'A' is capitalized or not. LoRa - radio LoRA - machine learning

What's the performance penalty of LoRA?

During training, it's more efficient than full finetuning because you only update a fraction of the parameters via backprop. During inference, it can ...

1) ... be theoretically a tad slower if you add the LoRA values dynamically during the forward pass (however, this is also an advantage if you want to keep a separate small weight set per customer, for example; you run only one large base model and can apply the different LoRA weights per customer on the fly)

2) ... have the exact same performance as the base model if you merge the LoRA weights back with the base model.

gotta say naming is hard I thought this was about LoRa (from "long range") or LoRaWAN, the IoT sensors communication.

It's cheap and sleazy to steal a name from another project to ride it's fame.

Excellent and practical example! I'm curious if there's a comparable one using Julia or JavaScript.

Someone somewhere is already working on naming their project Lehsun.. /s

wow definitely thought this was about LoRa at first.