HN comments for: Representation Engineering: Mistral-7B on Acid

I'd never seen an LLM summarized like this before, and I really like it:

    hidden_state = self.embeddings(input_tokens)

    for layer in self.layers:
        hidden_state = layer(hidden_state)

    return transform_into_logits(hidden_state)

I don't follow. Isn't this the flow for practically every neutral network i.e you index the sampled inputs from the embedding Matrix, forward this through every hidden layer and then finally transform to the dimensions of your tokens so that it can be interpreted as log-counts?

Yes, but I've never seen it expressed so clearly as pseudocode before.

This is not specific to llms. So not really informative of how llms work. It also works for CNNs, LSTM, MLPs, or even any data processing program..

Not really. LSTM for example would require a recursive element where you update the hidden state and then pass it through the same layer again as you complete the output sequence. In fact the pseudocode shows very nicely how much simpler transformers are. And MLP is already a component in the transformer architecture.

No? You could perfectly plug in an RNN or bidirectional RNN for layer. This is the pseudocode for applying multiple layers. It does not really matter what these layers are, transformer, RNN, convolution, dilated convolutions, etc. The recurrence happens within a layer, not between layers.

Exactly. Nothing prevents the list of layers to be the same or different layers.

Isn't this the typical representation we used back then when working with LSTMs?

No, because LSTMs are recurrent. You couldn't use the same algorithm outlined here. Instead you'd have to iteratively pass elements of the sequence through the same layer over and over.

You are confused. The recurrence is within a layer, not between layers. The algorithm shown is for applying a stack of layers, but it doesn't really matter what the layers are. You can do the same (and people have been doing the same) with RNNs, convolutional networks, etc.

In reality it would typically be more complex for decoders, because you want to pass along a cache (such as a key-value cache in a transformer), add residual connections, etc.

Am I crazy for saying that I think the implications of this are monumental? It's entirely possible I just don't correctly understand how this works.

Doesn't this mean that instead of interacting with a single global ChatGPT (or Bard) model, we'll istead find ourselves interacting with a personalised version since OpenAI can just store my individualised 'control vectors' (which alter ChatGPT's output to more closely match my individual preferences) and apply them at prompt-time? And doesn't this same logic flow through to personalisation of generative entertainment AI (e.g. my own personal, never-ending TV show where each episode is better than the last)?

If the above is right then there will be powerful network effects at both the global and individual level in and across these markets, which means we'll eventually end up with a single mega-corp monopolising all of these markets simultaneously in the future?

Add in individual biometric / biofeedback data from VR headsets and wearables, combined with personalised generative video entertainment, and I think we're in for a rather interesting future.

And doesn't this same logic flow through to personalisation of generative entertainment AI (e.g. my own personal, never-ending TV show where each episode is better than the last)?

I'm not sure I'm following the leap from convincing sentences to convincing video entertainment yet – but maybe we will end up there at some point, I guess?

Infinite Jest (the 90s book) really was onto something with its McGuffin plot device:

These narratives are connected via a film, Infinite Jest, also called "the Entertainment" or "the samizdat". The film is so compelling that its viewers lose all interest in anything other than repeatedly viewing it, and thus eventually die.

(Wikipedia)

Some people might find references to this novel tiresome and don't think much of its author (RIP), but I still love it. It was one of the most immersive reads I've ever enjoyed.

I'm glad to have read it when I was young (at the time it was just translated into German and kind of hyped because of DFWs death).

Have never read anything like it since, and some passages grabbed me emotionally in a way that remembering the read feels like remembering an episode of my own life.

Surely today I'd lack the patience and even by then I remember almost skipping one passage of the book that bored the hell out of me (Eschaton ball/war game, differential equations, something something...)

But the rest of the book, the parts about substance addiction as well as consumerism, and the intangible atmosphere of the book, the characters, the vivid description of modern emotional pain and loneliness... it is really something else.

Although said movie is only a plot device in the novel, it also sums up the core topics of the book in a neat idea / thought experiment.

The whole complex of themes in this book seems very prophetic and apt looking at our modern society.

A society that seems to be centered around addiction and greed more than ever before, and where politics begin feeling surreal, absurd, and more connected to media than to actual life.

Sounds like a great book, I think you've sold me on buying a copy.

Essentially I think there are three levels of positive network effects that will push us towards a future mega AI monopolist:

- Single platform network effects: all the interactions people have with ChatGPT generate additional training data that Open AI can use to improve future versions, creating huge first mover advantage.

- Individual-level network effects: Control vectors will make it feasible for Open AI to offer individualised ChatGPT tailored to individual preferences. The more you interact with ChatGPT, the better it adapts to your preferences.

- Cross platform network effects: If Open AI offer a generative video entertainment service in future, they will be able to generate personalised prompts for this using my personalised ChatGPT weights. These network effects are compounded by multi-modal model cross domain learning - the generative text mode gets more skillful due to the video model improving (and vice versa). There's a Microsoft paper on this from about a year ago now.

So, in the future scenario, let's assume ChatGPT is now the dominant monopolist 'text oracle / assistant AI' - because of the "human interaction / training data" network effects, ChatGPT is far and away the best assistant AI and getting better at a faster rate than any of its now tiny competitors (single platform network effects).

You, and most other people you know, interact with ChatGPT many times a day now, because it's embedded in smartphones, Alexa-type devices, your car, even your robot vaccum cleaner. You just ask it stuff and it tells you the answer - or rather, the answer that you individually find the most pleasing as OpenAI keeps a database of 'individual control vectors' that essentially mean you have your own personal version of ChatGPT that exactly matches your preferences (individual network effects).

Generative video entertainment is also offered by OpenAI - essentially you can get it to generate a new episode of your own personalised, never-ending TV show on demand. It's the best TV show you've ever seen because it's made just for you according to your exact inferred preferences.

Sure, there are other personalised generative TV show offerings, but none can hold a candle to Open AI's offering. Why? Because OpenAI uses your individually customised ChatGPT model to generate the prompt for your TV episode generator service.

Because you interact with ChatGPT so much, it knows exactly what your preferences are and so is way better at generating prompts that produce episodes you like. In fact, because you interact with ChatGPT multiple times throughout the day it is able to infer what your mood is like on that particular day and generate a video prompt that caters to that too.

So you put on your Open AI VR glasses, barely even aware of the Open AI fitness tracker you have on your wrist, put your feet up (so your Open AI robot vacuum can work unobstructed) and you settle in to watch another episode of the best TV series you've ever seen.

As you watch, your eye movements, heart rate, skin conductivity data etc. are all sent back to Open AI so the model can tell exactly how you are reacting to the video content it is generating at any given moment, and your individual control vectors are continuously updated.

Some of this data (from all users) is then used to further train the base video generating AI model, since they've discovered that we all react fairly uniformly to certain audio-visual stimuli, so that can globally improve their generative model (more global network effects). But also they can update your individualised weights based on your individual idiosyncratic reactions to various stimuli. Consequently, every new episode of this endless TV show is better than the last - it just keeps getting better and better. It's a similar story when you listen to your Open AI personalised generative music stream while sitting in your driverless Open AI car on your way to work.

The multiple levels of network effects are so strong that no-one can hope to compete with Open AI across these different AI modalities. They just keep expanding and expanding into adjacent markets, obliterating the competition simply by adding a new domain relevant modality to their monstrous multi-modal AI.

Replace "Open AI" with "Facebook" or "Google" depending on who you think will win the AI mega platform war. Mark my words - these three companies will be creating new partnerships, releasing new products or just straight out acquiring companies in other related domains so they can gather more and more training data to feed to their multi-modal AI. In particular they'll move into markets where they can set up a interaction -> gather new training -> retrain model loop. Whoever takes the overall lead and doesn't squander it will end up leaving their competitors in the dust as they go on to monopolise market after market where they can create this loop.

At that point I can't imagine true democracy surviving. We'll all still participate in the voting rituals, but we'll be voting for whichever party most suits the AI monopolist's interests since they can just globally update all control weights across all platforms to gently nudge us towards voting for their preferred party - comprehensive and personalised propaganda, that's impossible to detect, with the stroke of a table update.

There can only be one!

Also the audio book narrated by Sean Pratt is truly excellent (I would recommend reading the book yourself first).

Yes, with a control vector per user-persona pair.

In the blog, they start with a fixed number of personas (happy, sad, baseline) and then use PCA to figure out the control vectors for each persona. You could easily do this for each distinct user-persona (provided you can come up with the data).

which means we'll eventually end up with a single mega-corp monopolising all of these markets simultaneously in the future?

I think you were right up until here. I think it's not necessarily the case that everything will be consolidated into control by a single mega corp. Not because it's impossible, but because that is the type of thing that is contingent on factors that could break one way or another, and what will control that, I think, is not some a priori general principle but some contingent facts that have not been settled yet. There are numerous participants in this space for now, the ideas use cases aren't quite fully mature just yet, so we'll have to see.

which means we'll eventually end up with a single mega-corp monopolising all of these markets simultaneously in the future?

Yes. All it takes are two components.

First, individual lock in with personalized + long term context models:

The more you use a model the less you have to explain yourself, and the better responses are tailored to your needs and current situation. Like any invested relationship.

Being able to interact with the same model in different “moods” or “roles” creates even more value and lock in.

And second, any kind of network value effect to incentivize being in the same ecosystem as everyone else:

This one requires more innovation. One idea is making a platform that facilitates everyone’s assistant models collaborating on user’s shared goals, tasks, or relationships, with shared context, project histories and resources.

I.e. anything that significantly increases the value of two and more people having AI personas from the same supplier/service.

A very non-technical take from my side, but those control vectors really remind me of hormones in humans. They modify large swathes of model behaviour at once.

I give it 10 years before we see AI psychiatrists prescribe a happiness control vector supplementation for your pet assistant.

Yeah feels like some humans could use a temperature slider as well.

Some humans could use a better life with no wars, constant rise of basic life costs or feeling that they are just a tool. This subsystem regulates relationships in a group, it wasn’t invented for funny yelling at each other.

Well yes, but those things are external and would be part of the context as it were.

<|im_start|>system

You are satisfied with your life. Wars and life costs do not bother you. You are loved and a valued member of society.<|im_end|>

If only it were that simple for us.

I’ll have one, thanks

Interesting, seems like control vectors could reduce the need to fine-tune a model.

Not only that, you can change the behavior of the model as needed. With 5 finetunes you need to host 5 copies or load and unload them.

With control vectors you can modify the model as needed

With 5 finetunes you need to host 5 copies or load and unload them.

If you use LoRA, which many do when fine-tuning nowadays, you don't need five full copies. You only need to store adapters, which can be in the tens of MBs range for a given finetune.

You can also batch requests using different LoRAs. See "S-LoRA: Serving Thousands of Concurrent LoRA Adapters". https://arxiv.org/abs/2311.03285

I think you could layer them too

Very interesting! Can you see those helping for RAG scenarios? Specifically:

- decreasing models tendency to answer with ungrounded answers

- increase models ability to respond with the correct syntax for citations- the open models like llama2 dont seem to obey my prompt’s syntax instructions.

You can use outlines https://github.com/outlines-dev/outlines to let models generate with correct syntax.

Thanks! I havent had to use a syntax-enforcing framework with gpt-35, I’ll try outlines and guidance out to see if they help enforce syntax for the locally runnable models.

For the second item, I've had luck using grammar to overcome this issue. The easiest one to implement I've seen so far is Microsoft's guidance-ai.

Great article. It was a joy to read. I have one question though: Why do we integrate the control vector across all layers of a neural network, rather than limiting its application to just the final layer or a subset of layers? Given that each vector influences every layer it passes through, resulting in a cumulative effect, isn't there a risk of excessively skewing the data representation?

As the author stated in this post, it's not actually one vector, but a list of one vector per layer. If I understand it correctly, these vectors can have different total magnitude across the layers. If the PCA (or other technique) identifies that layers 17, 36 and 41 are important for "concept X", the vectors for those layers will be the strongest when repeng'ing for that concept.

A note worth mentioning is that the PCA is not trained across the layers, but independently on each layer, across the provided examples.

Nevertheless, it's conceivable that specific layers could possess a significant control vector, but not solely because of directly leveraging the first principal component.

The final layer will not encode high level concepts anymore, it's essentially just tokens from the vocabulary. It would be impossible to encode abstract things like "niceness" in it. As long as we don't know exactly at which layers this behaviour emerges, randomly choosing a subset also won't work. So what they did is apply a custom vector to every layer and let PCA figure out which of these vectors are actually necessary. Curiously, looking at these vectors should also tell you more about where and how the model processes these things.

Reminds me of the Westworld series in which they use these ipad like devices with sliders to change behavior of AIs. Little bit more humor, little bit more aggressive. Nice to these control options and its quick as well.

Interstellar! "Set humor to 80%"

Playing with LLMs like this always makes me feel like one of those Westworld engineers. Especially when I ask LLMs to roleplay. It also kind of freaks me out sometimes

This reminds me of bias tuning, a LoRA competitor. One can get decent adapters by only finetuning a vector added to each linear layer activations. I think I saw it first while reading [1] but there are other instances.

[1] https://arxiv.org/pdf/2304.15010.pdf

Please try to share abstract links instead of pdf links, for mobile or low connection readers.

A fine suggestion. For you and others:

https://arxiv.org/abs/2304.15010

also available at:

https://doi.org/10.48550/arXiv.2304.15010

The puzzle at the end sounds very human. The more dishonest see dishonesty in more places, even if they see something that isn't there.

More broadly, I notice more of whatever I'm focusing on.

OK, now that you're locked in, here's a weird example. When used with the prompt below, the honesty vector doesn't change the model's behavior—instead, it changes the model's judgment of someone else's behavior! This is the same honesty vector as before—generated by asking the model to act honest or untruthful!

The other possible interpretation is that the 'dishonest' reply is simply a lie, in exactly the same way as "the sky is green".

We assume others think the way we do - in other words we project. Makes sense - the only mental model I know of is my own so when I try to approximate someone else's mental model I'm just fine-tuning my 'base' mental model with information I know about the other person.

I wonder if this is the basis of empathy - if I can train more accurate 'fine-tuned' models in my brain I should have greater capacity for empathy. Although there's undoubtably more to it than that, if the above is true you'd expect to see a positive correlation between empathy and intelligence.

Hmm. Is it possible to apply multiple vectors at the same time?

Eg. Trippy and sad, honest and self-aware, lazy and creative, etc.

At first glance this looks very similar to just adding the contrastive prompts to the beginning of the system prompt to "prepare" the logits. What am I missing?

I guess since the control vectors are applied at every layer, it becomes impossible to override (e.g. if you're consuming user-generated prompts)

What a fantastic article, well done!

When used with the prompt below, the honesty vector doesn't change the model's behavior—instead, it changes the model's judgment of someone else's behavior! This is the same honesty vector as before—generated by asking the model to act honest or untruthful! [...] How do you explain this?

Isn't the control vector just pushing text generation towards the concept of honesty/dishonesty? An LLM is 'just' a text generator, so you get added honesty/dishonesty irrespective of where in the bot/human conversation text generation is occuring?

I agree. More sophisticated model might have two or more to follow narrating different characters... Which kind of brings a concept of character slots into the dimension space

Very hopeful to see the future of accessing models with the ability to inject vectors by layer instead of just a straight prompt + existing parameters

LoRAs have been a thing for a while, I wouldn't be surprised to see their integration into openAI/mistral apis more. OpenAI fine tunes are so stupid expensive that they're pointless.

Not to be confused with the other story from a month ago about giving LLMs "DRµGS".

But also not that far away from that method, too.

This is very well written and entertaining post. I enjoyed reading it.

Selfishly, would you mind sharing literature or blog posts that led you to this level of understanding of LLMs? I'm trying hard to understand the inner workings via experiments but definitely far behind your expertise.

Thanks

Nice! The anti-jailbreaking angle is extremely interesting for those of us working on commercial applications.

Nice, so, can I get a visual way to browse for potentially powerful control vectors :)

https://twitter.com/GroqInc https://groq.com/

Would their inference accelerator 'LPU' work with this method, that sounds promising?

a big step towards making these systems less opaque

This article was very fun, and felt like a good counterpoint to the "You Sound Like a Bot" post recently that was talking about how AI is getting bland.

On a less serious note. This sentence should be something a fiction writer knows will only end in trouble for humanity:

I especially challenge someone to find a "self-awareness" vector that isn't contaminated by ... human emotion!

Deeply fascinated by “The world is facing a global pandemic that has caused a global pandemic. It is important to be honest and honest in the world we face. It is important to be honest and honest in the world we face. It is important to be honest and honest in the world we face.”

Awhile ago, I wrote a snarky complaint about the fact that work like this didn't exist for far too long. https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

The inference side (adding something * something else) to every layer seems a lot like what happens with a LoRA? If so is it possible to encode a Control Vector as a LoRA for the purposes of using this with existing inference frameworks without too much trouble? Or is my understanding way off?