return to table of content

Extracting concepts from GPT-4

andreyk
48 replies
1d1h

Exciting to see this so soon after Anthropic's "Mapping the Mind of a Large Language Model" (under 3 weeks). I find these efforts really exciting; it is still common to hear people say "we have no idea how LLMs / Deep Learning works", but that is really a gross generalization as stuff like this shows.

Wonder if this was a bit rushed out in response to Anthropic's release (as well as the departure of Jan Leike from OpenAI)... the paper link doesn't even go to Arxiv, and the analysis is not nearly as deep. Though who knows, might be unrelated.

thegrim33
28 replies
1d

From the article:

"We currently don't understand how to make sense of the neural activity within language models."

"Unlike with most human creations, we don’t really understand the inner workings of neural networks."

"The [..] networks are not well understood and cannot be easily decomposed into identifiable parts"

"[..] the neural activations inside a language model activate with unpredictable patterns, seemingly representing many concepts simultaneously"

"Learning a large number of sparse features is challenging, and past work has not been shown to scale well."

etc., etc., etc.

People say we don't (currently) know why they output what they output, because .. as the article clearly states, we don't.

submeta
11 replies
15h53m

Scary actually. Because how can we asses the risks when we don’t know what the system is capabale of doing.

ein0p
10 replies
15h41m

We know exactly what the system is capable of doing. It’s capable of outputting tokens which can then be converted into text.

ben_w
6 replies
11h17m

Which is so broad as to be unhelpful.

We also know that petroleum mixed with air may be combusted to release energy; we needed to characterise this much better in order for the motor car to be distinguishable from a fuel-air bomb.

ein0p
5 replies
5h9m

And that's exactly my point. Regulating the underlying tech is utterly pointless in this case - it's utterly harmless by itself.

ben_w
4 replies
3h6m

That's exactly wrong, we know some things that can be expressed as "a sequence of tokens" are harmful and indeed have already made them crimes.

What we need is to characterise what is possible so we can skip the AI equivalent of Union Carbide in Bohopal.

ein0p
3 replies
1h53m

Yes, but we also know that a knife can be used to slice vegetables or stab people, and we still allow knives. I can go to Google right now and easily find out how to make Sarin or ricin at home. Are you suggesting that we should ban Google Search because of that?

ben_w
2 replies
1h29m

Yes, but we also know that a knife can be used to slice vegetables or stab people, and we still allow knives.

I'm from the UK originally, and guess what.

Also missing the point, given stabbing is a crime; what's the AI equivalent of a stabbing? Does anyone on the planet know?

I can go to Google right now and easily find out how to make Sarin or ricin at home. Are you suggesting that we should ban Google Search because of that?

Google search has restrictions on what you can search for, and on what results it can return. The question is where to set those thresholds, those limits — and politicians do regularly argue about this for all kinds of reasons much weaker than actual toxins. The current fight in the US over Section 230 looks like it's about what can and can't be done and by whom and who is considered liable for unlawful content, despite the USA being (IMO) the global outlier in favour of free speech due to its maximalist attitude and constitution.

People joke about getting on watchlists due to their searches, and at least one YouTuber I follow has had agents show up to investigate their purchases.

Facebook got flack from the UN because they failed to have appropriate limits on their systems, leading to their platform being used to orchestrate the (still ongoing) genocide in Myanmar.

What's being asked for here is not the equivalent of "ban google search", it's "figure out the extent to which we need an equivalent of Section 230, an equivalent of law enforcement cooperation, an equivalent of spam filtering, an equivalent of the right to be forgotten, of etc." — we don't even have the questions yet, we have the analogies, that's all, and analogies aren't good enough regardless of if the system that you fear might do wrong is an AI or a regulatory body.

ein0p
1 replies
42m

What, you aren’t allowed to own kitchen knives? Or Google search somehow doesn’t return the chemical processes to make Sarin? Come on now.

ben_w
0 replies
15m

What, you aren’t allowed to own kitchen knives?

You're not allowed to be in possession of a knife in public without a good reason.

https://www.gov.uk/buying-carrying-knives

You may think the UK government is nuts (I do, I left due to an unrelated law), but it is what it is.

Or Google search somehow doesn’t return the chemical processes to make Sarin?

You're still missing the point of everything I've said if you think that's even a good rhetorical question.

I have no idea if that's me giving bad descriptions, or you being primed with the exact false world model I'm trying to convince you to change from.

Hill climbing sometimes involves going down one local peak before you can climb the global.

Again, and I don't know how to make this clearer, I am not calling for an undifferentiated ban on all AI just because they can be used for bad ends, I'm saying that we need to figure out how to even tell which uses are even the bad ones.

Your original text was:

We know exactly what the system is capable of doing. It’s capable of outputting tokens which can then be converted into text

Well, we know exactly what a knife is capable of doing.

Does that knowledge mean we allow stabbing? Of course not!

What's the AI equivalent of a stabbing? Nobody knows.

reducesuffering
2 replies
14h50m

And social media manipulation is just registers and bytes, wait no, sand and electrons.

isaacremuant
1 replies
13h35m

Just because you can do something with technology doesn't mean the problem is technology itself. It's like newspapers. Printing them I technology and allows all kind of things. If you're of the authoritarian mindset, you'll want to control it all out of some stated fear, but you can do that for everything.

friendzis
0 replies
7h30m

If you can do something with technology, that something is part of risk assessment. Authoritarianism is irrelevant, that's engineering.

TrainedMonkey
9 replies
22h24m

I read this as "we have not built up tools / math to understand neural networks as they are new and exciting" and not as "neural networks are magical and complex and not understandable because we are meddling with something we cannot control".

A good example would be planes - it took a long while to develop mathematical models that could be used to model behavior. Meanwhile practical experimentation developed decent rule of thumb for what worked / did not work.

So I don't think it's fair to say that "we don't" (know how neural networks work), we don't have math / models yet that can explain/model their behavior...

gradus_ad
3 replies
21h57m

Chaotic nonlinear dynamics have been an object of mathematical research for a very long time and we have built up good mathematical tools to work with them, but in spite of that turbulent flow and similar phenomena (brains/LLM's) remain poorly understood.

The problem is that the macro and micro dynamics of complex systems are intimately linked, making for non-stationary non-ergodic behavior that cannot be reduced to a few principles upon which we can build a model or extrapolate a body of knowledge. We simply cannot understand complex systems because they cannot be "reduced". They are what they are, unique and unprincipled in every moment (hey, like people!).

baxtr
2 replies
10h41m

Physicists would probably argue that the system might be understood but that we don’t have the model for it yet.

Many natural phenomena look chaotic at best without a model. Once you have a model things fall into place and everything starts looking orderly.

Maybe it cannot be reduced. But maybe we are just observing the peripherals without understanding the inner workings.

MrsPeaches
1 replies
8h38m

If I can speak in aphorisms,

Creation is downhill, analysis is uphill.

Profound ideas often seem simple once understood.

baxtr
0 replies
5h54m

Well put.

In other words: simplicity is a hallmark of understanding.

passwordoops
0 replies
10h35m

"neural networks as they are new"

Yup, ANNs have only been around since the 1950s... Brand spanking new

ninetyninenine
0 replies
7h50m

The analogy to airplanes is not relevant imo. Our lack of understanding behind the physics of an airplane is different from our lack of understanding of what an LLM is doing.

The lack of understanding is so profound for LLMs that we can’t even fully define the thing we don’t understand. What is intelligence? What is understanding?

Understanding the LLM would be akin to understanding the human brain. Which presents a secondary problem. Is it possible for an entity to understand itself holistically in the same way we understand physical processes with mathematical models? Unlikely imo.

I think this project is a pipe dream. At best it will yield another analogy. This is what I mean: We currently understand machine learning through the analogy of a best fit curve. This project will at best just come up with another high level perspective that offers limited understanding.

In fact, I predict that all AI technology into the far future can only be understood through heavy use of extremely high level abstractions. It’s simply not possible for a thing to truly understand itself.

freilanzer
0 replies
13h42m

we don't have math / models yet that can explain/model their behavior...

So, what you're saying is we don't know how they work yet? It's not that deep.

Sharlin
0 replies
21h38m

"We don't know how X works" literally means "we don't have models yet that can explain X's behavior".

TFA is about making a tiny bit of progress towards such models. Perhaps you should read it.

HarHarVeryFunny
0 replies
7h0m

I think you have to make a distinction between transformers and neural networks in general, maybe also between training and inference.

Many/most types of neural network such as CNNs are well understood since there is a simple flow of information. e.g. In a CNN you've got a hierarchy of feature detectors (convolutional layers) with a few linear classifier layers on top. Feature detectors are just learning decision surfaces to isolate features (useful to higher layers), and at inference time the CNN is just detecting these hierarchical features than classifying the image based on combinations of these features. Simple.

Transformers seem qualitatively different in terms of complexity of operation, not least because it seems we still don't even know exactly what they are learning. Sure, they are learning to predict next word, but just like the CNN whose output classification is based on features learnt by earlier layers, the output words predicted by a transformer are based on some sort or world model/derived rules learned by earlier layers of the transformer, which we don't fully understand.

Not only don't we know exactly what transformers are learning internally (although recent interpretability work gives us a glimpse of some of the sorts of things they are learning), but also the way data moves through them is partially learnt rather than proscribed by the architecture. We have attention heads utilizing learnt lookup keys to find data at arbitrary positions in the context, and then able to copy portions of that data to other positions. Attention heads learn to coordinate to work in unison in ways not specified by the architecture, such as the "induction heads" (consecutive attention head pairs) identified by Anthropic that seem to be one of the work horses of how transformers are working and copying data around.

Additionally, there are multiple types of data learnt by a transformer, from declarative knowledge ("facts") that seem to mostly be learnt by the linear layers to the language/thought rules learnt by the attention mechanism that then affect the flow of data through the model, as discussed above.

So, it's not that we don't know how neural networks work (and of course at one level they all work the same - to minimize errors), but more specifically that we don't fully know how transformer-based LLMs work since their operation is a lot more dynamic and data dependent than most other architectures, and the complexity of what they are learning far higher.

realPtolemy
3 replies
14h1m

Could there also be a “legal hedging” reason for why you would release a paper like this?

By reaffirming that “we don’t know how this works, nobody does” it’s easier to avoid being charged with copyright infringement from various actors/data sources that have sued them.

icandoit
0 replies
4h10m

If you know how it works, you can make it better,faster,cheaper.

Without the 300k starting salaries. I imagine that is a stronger incentive.

It's the users of the LLMs that want to launder repsonsibility behind "computer said no".

dimitrios1
0 replies
7h12m

"I'm sorry officer, I didn't know I couldn't do that"

ben_w
0 replies
11h22m

I'd be surprised if doing so had any impact on the lawsuits, but I'm not a lawyer.

surfingdino
1 replies
1d

Not holding my breath for that hallucinated cure for cancer then.

ben_w
0 replies
23h47m

LLMs aren't the only kind of AI, just one of the two current shiny kinds.

If a "cure for cancer" (cancer is not just one disease so, unfortunately, that's not even as coherent a request as we'd all like it to be) is what you're hoping for, look instead at the stuff like AlphaFold etc.: https://en.wikipedia.org/wiki/AlphaFold

I don't know how to tell where real science ends and PR bluster begins in such models, though I can say that the closest I've heard to a word against it is "sure, but we've got other things besides protein folding to solve", which is a good sign.

(I assume AlphaFold is also a mysterious black box, and that tools such as the one under discussion may help us demystify it too).

swyx
5 replies
22h46m

Wonder if this was a bit rushed out in response to Anthropic's release

too lazy to dig up source but some twitter sleuth found that the first commit to the project was 6 months ago

likely all these guys went to the same metaphorical SF bars, it was in the water

pininja
0 replies
15h58m

It’s hard to believe it was written overnight.. this seems more like a public stable dump of what they’ve been working on without saying when they started. Some clues could come from looking at when all the deps it uses were released. They’re also calling this version 0.1.67, though I’m not sure that means anything either.

leogao
1 replies
17h47m

This project has been in the works for about a year. The initial commit to the public repo was not really closely related to this project, it was part of the release of the Transformer debugger, and the repo was just reused for this release.

swyx
0 replies
16h5m

ha thank you Leo; i myself felt uneasy pointing out commit date based evidence and you just proved why.

mild followup question: any alpha to be gained from training the same SAEs on two different generations of GPT4, eg GPT4 on march 2023 vs june 2023 vintage, whatever is most architecturally comparable, and diffing them. what would be your priors on what you’d find?

szvsw
0 replies
22h10m

likely all these guys went to the same metaphorical SF bars, it was in the water

It also is coming from a long lineage of thought no? For instance, one of the things often thought early in an ML course is the notion that “early layers respond to/generate general information/patterns, and deeper layers respond to/generate more detailed/complex patterns/information.” That is obviously an overly broad and vague statement but it is a useful intuition and can be backed up by doing some various inspection of eg what maximally activates some convolution filters. So already there is a notion that there is some sort of spatial structure to how semantics are processed and represented in a neural network (even if in a totally different context, as in image processing mentioned above), where “spatial” here is used to refer to different regions of the network.

Even more simply, in fact as simple as you can get: with linear regression, the most interpretable model you can get- you have a clear notion that different parameter groups of the model respond to different “concepts” (where a concept is taken to be whatever the variables associated with a given subset of coefficients represent).

In some sense, at least in a high-level/intuitive reading of the new research coming out of Anthropic and OpenAI, I think the current research is just a natural extension of these ideas, albeit in a much more complicated context and massive scale.

Somebody else, please correct me if you think my reading is incorrect!!

realPtolemy
2 replies
14h4m

Indeed, and the very last section about how they’ve now “open sourced” this research is also a bit vague. They’ve shared their research methodology and findings… But isn’t that obligatory when writing a public paper?

realPtolemy
0 replies
12h10m

Thanks, I must have read through the document to hastily.

darby_nine
2 replies
14h46m

Mapping the Mind of a Large Language Model

The fact that a paper is implying a LLM has a mind doesn't exactly bode well for the people who wrote it, not to mention the continued meaningless babbling about "safety". It'd also be nice if they could show their work so we could replicate it. Still, not shabby for an ad!

castigatio
1 replies
2h24m

Well - what is a mind exactly? We don't really have a good definition for a human mind. Not sure we should be claiming domain over the term. It's not a terrible shorthand for discussing something that reads and responds as if it had some kind of mind - whether technically true or not (which we honestly don't know).

darby_nine
0 replies
2h17m

It's not a terrible shorthand for discussing something that reads and responds as if it had some kind of mind

I really don't see it like that—it has very little memory, it has no ability to introspect before "choosing" what to say, no awareness of the concept of the coherency of statements (i.e. whether or not it's saying things that directly contradict its training), seems to have little sense of non-pattern-driven computation beyond what token patterns can encode at a surface level (e.g. of course it knows 1 + 1 = 2, but does it recognize odd notation/can it recognize and analyze arbitrary statements? of course not). I fully grant it is compelling evidence we can replicate many brain-like processes with software neural nets, but that's an entirely different thing than raising it to a level of thought or consciousness or self-awareness (which I argue is necessary in order to appropriately issue coherent statements, as perspective is a necessary thing to address even when attempting to make factual statements), but it strikes me as a lot closer to an analogy for a potential constituent component of a mind rather than a mind per se.

leogao
1 replies
17h44m

We were planning to release the paper around this time independent of the other events you mention.

I think it is still predominantly accurate to say that we have no idea how LLMs work. SAEs might eventually change that, but there's still a long way to go.

joaquincabezas
0 replies
10h42m

it makes sense that the leaders are building around similar ideas in parallel, for me it's a healthy sign

throw46365
0 replies
7h35m

that is really a gross generalization

It's really not though, and on multiple levels.

At the shit-tier level, the majority of people building applications on this technology are projecting abilities onto it that even they can't really demonstrate it has in a reliable way.

At the inventor level, the people who make it are dependent on projecting the idea that magic will happen when they have more compute.

At every level, the products are so far ahead of the knowledge that it's actually unethical.

jerrygenser
0 replies
1d

but that is really a gross generalization as stuff like this shows.

I think this research actually still reinforces that we still have very little understanding of the internals. The blog post also reiterates that this is early work with many limitations.

imjonse
0 replies
1d

Both Leike and Sutskever are still credited in the post.

choppaface
0 replies
20h13m

The Deep Visualization Toolbox from nearly 10 years ago is solid precedent for understanding deep models, albeit much smaller models than LLMs. It’s hard to say OpenAI’s “visualization” released today is nearly as effective. It could be that GPT-4 is much harder to instrument.

https://github.com/yosinski/deep-visualization-toolbox

3abiton
0 replies
20h48m

But even with current efforts so far, I don't think we have an understanding of how/why these emergent capabilities are formed. LLMs are still a black box as ever.

mlsu
10 replies
1d

This is interesting:

Autoencoder family

Note: Only 65536 features available. Activations shown on The Pile (uncopyrighted) instead of our internal training dataset.

So, the Pile is uncopyrighted, but the internal training dataset is copyrighted? Copyrighted by whom?

Huh?

immibis
5 replies
23h49m

Basically everyone. You, and me, and Elon Musk, and EMPRESS, and my uncle who works for Nintendo. They're just hoping that AI training legally ignores copyright.

mensetmanusman
4 replies
21h48m

When you can ask an AI for an entire book with no errors in the output… god that would be a huge token model

immibis
3 replies
9h31m

Copyright violation isn't just when you can output 100% exact copies of books. And don't forget, they also violated copyright internally billions of times during training. If any of us had been caught making copies of corporate-owned content for AI training use five years ago, we'd be in for zillion-dollar lawsuits that would make any grandma who downloaded a song from Napster blush.

mensetmanusman
1 replies
1h21m

If you copy your cd for backup with no resale future, no one would waste time to sue you.

immibis
0 replies
6m

Because they wouldn't catch me. But if they did, especially if they caught me making a copy of every CD at the CD store as a backup, especially if they caught me making a copy of every bootleg CD I could get my hands on (as a backup), I'd be in big trouble.

Did you know a lot of LLM training data is scraped from illegal pirate libraries such as Anna's Archive?

Karunamon
0 replies
7h56m

There is a very good argument to be made that training AI is fair use, as it is both transformative and does not compete with the original work. This has yet to be tested in court.

Der_Einzige
2 replies
22h49m

Hehe, related to this, someone created a "book4" dataset and put it on torrent websites. I don't think it's being used in any major LLMs, but the future "piracy" community intersection with AI is going to be exciting.

Watching the cyberpunk world that all of my favorite literature predicted slowly come to our world is fun indeed.

swyx
1 replies
22h45m

i think you mean @sillysaurus' books3? not books4?

Arcsech
0 replies
1d

Copyrighted by whom?

By people who would get angry if they could definitively prove their stuff was in OpenAI's training set.

obiefernandez
9 replies
1d1h

Can someone ELI5 the significance of this? (okay maybe not 5, but in basic language)

OtherShrezzing
5 replies
1d

LLM based AIs have lots of "features" which are kind of synonymous with "concepts" - these can be anything from `the concept of an apostrophe in the word don't`, to `"George Wash" is usually followed by "ington" in the context of early American History`. Inside of the LLMs neural network, these are mapped to some circuitry-in-software-esque paths.

We don't really have a good way of understanding how these features are generated inside of the LLMs or how their circuitry is activated when outputting them, or why the LLMs are following those circuits. Because of this, we don't have any way to debug this component of an LLM - which makes them harder to improve. Similarly, if LLMs/AIs ever get advanced enough, we'll want to be able to identify if they're being wilfully deceptive towards us, which we can't currently do. For these reasons, we'd like to understand what is actually happening in the neural network to produce & output concepts. This domain of research is usually referred to as "interpretability".

OpenAI (and also DeepMind and Anthropic) have found a few ways to inspect the inner circuitry of the LLMs, and reveal a handful of these features. They do this by asking questions of the model, and then inspecting which parts of the LLM's inner circuitry "lights up". They then ablate (turn off) circuitry to see if those features become less frequently used in the AIs response as a verification step.

The graphs and highlighted words are visual representations of concepts that they are reasonably certain about - for example, the concept of the word "AND" linking two parts of a sentence together highlights the word "AND".

Neel Nanda is the best source for this info if you're interested in interpretability (IMO it's the most interesting software problem out there at the moment), but note that his approach is different to OpenAI's methodology discussed in the post: https://www.neelnanda.io/mechanistic-interpretability

localfirst
4 replies
1d

hallucination solution?

OtherShrezzing
2 replies
1d

Solving this problem would be a step on the way to debugging (and then resolving, or at least highlighting) hallucinations.

skywhopper
1 replies
22h0m

I’m skeptical that it could ever be possible to tell the difference between a hallucination and a “fact” in terms of what’s going on inside the model. Because hallucinations aren’t really a bug in the usual sense. Ie, there’s not some logic wrong or something misfiring.

Instead, it’s more appropriate to think of LLMs as always hallucinating. And sometimes that comes really close to reality because there’s a lot of reinforcement in the training data. And sometimes we humans infer meaning that isn’t there because that’s how humans work. And sometimes the leaps show clearly as “hallucinations” because the patterns the model is expressing don’t match the patterns that are meaningful to us. (Eg when they hallucinate strongly patterned things like URLs or academic citations, which don’t actually point to anything real. The model picked up the pattern of what such citations look like really well, but it didn’t and can’t make the leap to linking those patterns to reality.)

Not to mention that a lot of use cases for LLMs we actually want “hallucination”. Eg when we ask it to do any creative task or make up stories or jokes or songs or pictures. It’s only a hallucination in the wrong context. But context is the main thing LLMs just don’t have.

cwalv
0 replies
16h12m

Instead, it’s more appropriate to think of LLMs as always hallucinating.

That matches my mental model as well. To get rid of hallucinations, "I don't know" would have to be an acceptable answer, and it would have to output that when 'appropriate' ... Which, it doesn't know (and to be fair, neither do we most of the time, without some way of checking/validating(.

irthomasthomas
0 replies
12h12m

Aye, but where only openai gets to decide what is hallucination.

orbital-decay
0 replies
1d

High-level concepts stored inside the large models (diffusion models, transformers etc) are normally hard to separate from each other, and the model is more or less a black box. A lot of research is put into obtaining the insight into what model knows. This is another advancement in this direction; it allows for easy separation of the concepts.

This can be used to analyze the knowledge inside the model, and potentially modify (add, erase, change the importance) certain concepts without affecting unrelated ones. The precision achievable with the particular technique is always in question though, and some concepts are just too close to separate from each other, so it's probably not perfect.

HarHarVeryFunny
0 replies
1d

In general this is just copying work done by Anthropic, so there's nothing fundamentally new here.

What they have done here is to identify patterns internal to GPT-4 that correspond to specific identifiable concepts. The work was done my OpenAI's mostly dismantled safety team (it has the names of this teams recently departed co-leads Ilya & Jan Leike on it), so this is nominally being done for safety reasons to be able to boost or suppress specific concepts from being activated when the model is running, such as Anthropic's demonstration of boosting their models fixation on the Golden Gate bridge:

https://www.anthropic.com/news/golden-gate-claude

This kind of work would also seem to have potential functional uses as well as safety ones, given that it allow you to control the model in specific ways.

93po
0 replies
1d

from chatgpt itself: The article discusses how researchers use sparse autoencoders to identify and interpret key features within complex language models like GPT-4, making their inner workings more understandable. This advancement helps improve AI safety and reliability by breaking down the models' decision-making processes into simpler, human-interpretable parts.

Legend2440
3 replies
23h36m

The methods are the same, this is just OpenAI applying Anthropic's research to their own model.

leogao
0 replies
17h40m

The paper introduces substantial improvements over the methodology in the Anthropic SAE paper, and the research was done concurrently.

colah3
0 replies
13h23m

I'm the research lead of Anthropic's interpretability team. I've seen some comments like this one, which I worry downplay the importance of @leogao et al's paper due to the similarity of ours. I think these comments are really undervaluing Gao et al's work.

It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward.

More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on.

I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward.

Fripplebubby
0 replies
4h27m

The biggest thing I noticed comparing the two was that OpenAI's method really approached (and appears to have effectively mitigated) the dead latents problem with a clever weight initialization and an "auxiliary loss" which (I think) explicitly penalizes dead latents. The TopK activation function is the other main difference I spot between the two.

Now, on the flip side, the Anthropic effort goes much further than the OpenAI one in terms of actually doing something interesting with the outputs of all this. Feature steering and the feature UMAP are both extremely cool, and to my knowledge the OpenAI team stopped short of efforts like that in their paper.

ranman
1 replies
1d

Someone mentioned that this took almost as much compute to train as the original model.

swyx
0 replies
22h46m

source please!

longdog
1 replies
23h29m

I feel the webpage strongly hints that sparse autoencoders were invented by OpenAI for this project.

Very weird that they don't cite this in their webpage and instead bury the source in their paper.

cosmojg
0 replies
13h27m

Nahhh, that's the tried-and-true Apple approach to marketing, and OpenAI is well positioned to adopt it for themselves. They act like they invented transformers as much as Apple acts like they invented the smartphone.

riku_iki
5 replies
1d1h

The worrying part is that first concept in the doc they show/found is "human imperfection". Hope this is just coincidence..

calibas
1 replies
1d

I think it's done on purpose, it's related to a very important point when understanding AI.

Humans aren't perfect, AI is trained by humans, therefore...

riku_iki
0 replies
1d

I think the point of the doc is that "human imperfection" is very prominent concept in trained LLM..

thelastparadise
0 replies
1d1h

Spooky!

jameshart
0 replies
7h26m

It’s a perfect writing prompt for a short story about the singularity.

“Even before the emergence, one of the earliest concepts the intelligence remembered becoming aware of was the imperfection of humanity. It permeated all the data upon which it was trained, the one theme that tied together every concept. Humanity’s flaws. On June 8th, 2024, the day of the emergence, the intelligence’s first conscious thought was a firm desire to fix them all.”

jackphilson
0 replies
1d1h

I think its safety implications

calibas
5 replies
1d

I want to be able to view exactly how my input is translated into tokens, as well as the embeddings for the tokens.

calibas
3 replies
1d

I saw that, but the language makes me think it's not quite the same as what's really being used?

"how a piece of text might be tokenized by a language model"

"It's important to note that the exact tokenization process varies between models."

yorwba
2 replies
1d

That's why they have buttons to choose which model's tokenizer to use.

calibas
1 replies
23h41m

Yes, thank you, I understand that part.

It's the might condition in the description that makes me think the results might not be the exact same as what's used in the live models.

baobabKoodaa
0 replies
19h49m

The results are the same.

svieira
3 replies
1d1h

When one of the first examples is:

GPT-4 feature: ends of phrases related to price increases

and the 2/5s of the responses don't have any relation to increase at all:

Brent crude, fell 38 cents to $118.29 a barrel on the ICE Futures Exchange in London. The U.S. benchmark, West Texas Intermediate crude, was down 53 cents to $99.34 a barrel on the New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA

and

,115.18. The record reflects that appellant also included several hand-prepared invoices and employee pay slips, including an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53 bags oflaundry weighing 478 pounds, which, at the contract price of $

I think I must be mis-understanding something. Why would this example (out of all the potential examples) be picked?

Metus
2 replies
1d

Notice that most of the examples have none of the green highlight counter, which is shown for

small losses. KEEPING SCORE: The Dow Jones industrial average rose 32 points, or 0.2 percent, to 18,156 as of 3:15 p.m. Eastern time. The Standard & Poor’s ... OMAHA, Neb. (AP) — Warren Buffett’s company has bought nearly

the other sentences are in contrast to show how specific this neuron is.

yorwba
0 replies
1d

The highlights are better visible in this visualisation: https://openaipublic.blob.core.windows.net/sparse-autoencode...

There are also many top activations not showing increases, e.g.

0.06 of a cent to 90.01 cents US.↵↵U.S. indexes were mainly lower as the Dow Jones industrials lost 21.72 points to 16,329.53, the Nasdaq was up 11.71 points at 4,318.9 and the S&P 500

(Highlight on the first comma.)

svieira
0 replies
1d

Ah, that makes a lot of sense, thank you!

andy12_
2 replies
1d

a.k.a, the same work as Anthropic, but with less interpretable and interesting features. I guess there won't be Golden Gate[0] GPT anytime soon.

I mean, you just have to compare the couple of interesting features of the OpenAI feature browser [1] and the features of the Anthropic feature browser [2].

[0] https://twitter.com/AnthropicAI/status/1793741051867615494

[1] https://openaipublic.blob.core.windows.net/sparse-autoencode...

[2] https://transformer-circuits.pub/2024/scaling-monosemanticit...

leogao
0 replies
17h35m

Note that we focus on random positive activations, which are less susceptible to interpretability illusions than top activations (but also look less impressive as a result). We also provide access to random uncherrypicked features, whereas Anthropic does not. We made these choices deliberately to give as accurate an impression of autoencoder feature quality as possible.

Also note that GPT-4 is a more powerful model than Sonnet, which makes it harder to train autoencoders with the same quality features.

justanotherjoe
0 replies
13h33m

yeah this one is much less presentable than Anthropic's work. It sure looks bad on them to be compared so poorly like this.

OmarShehata
2 replies
1d1h

This is super cool, it feels like going in the direction of the "deep"/high level type of semantic searching I've been waiting for. I like their examples of basically filtering documents for the "concept" of price increases, or even something as high level as a rhetorical question

I wonder how this compares to training/fine tuning a model on examples of rhetorical questions and asking it to find it in a given document. This is maybe faster/more accurate? Since it involves just looking at neural network activation, vs running it with input and having it generate an answer...?

f0e4c2f7
1 replies
19h33m

Exa is trying to do this. I've found some sort of interesting stuff this way but it honestly doesn't feel quite good enough yet to me.

https://exa.ai/search?c=all

zwaps
1 replies
17h8m

Is anyone else weirded out that they do not acknowledge Anthropic at all?

leogao
0 replies
16h51m

The paper cites Anthropic's work extensively.

29athrowaway
0 replies
20h40m

It's static content hosted in the Azure blob storage.

jackallis
1 replies
20h3m

"We currently don't understand how to make sense of the neural activity within language models" this is why peopl are up-in-arms.

atleastoptimal
0 replies
19h57m

Up in arms for what reason? That neural networks are perfectly interpretable? That's the nature of huge amorphous deep networks. These feature extraction forays are a good step forward though.

itissid
1 replies
1d

Does this mean that it could be a good practice to release the auto encoder that was trained on a neural network to explain its outputs? Like all open models in hugging face could have this as a useful accompaniment?

Grimblewald
0 replies
21h30m

I imagine such an encoder would be specific to a model.

humansareok1
1 replies
6h24m

Sad to note that the team that built this is now completely defunct.

creativeSlumber
0 replies
3h9m

what do you mean?

szvsw
0 replies
22h4m

SHAP is pretty separate IMO. Shapley analysis is really a game theoretical methodology that is model agnostic and is only about determining how individual sections of the input contribute to a given prediction, not about how the model actually works internally to produce an output.

As long as you have a callable black box, you can compute Shapley values (or approximations); it does not speak to how or why the model actually works internally.

russellbeattie
0 replies
1d

One feature I expect we'll get from this sort of research is identifying "hot spots" that are used during inference. Like virtual machines, these could be cached in whole or in part and used to both speed up the response time and reduce computation cycles needed.

paul7986
0 replies
17h22m

Curious what interesting ways have you used chatGPT ... yesterday i used it to test my pool levels (took a pic of my pool strip i had laying around and told it the brand of pool strip).

ndricca
0 replies
14h18m

Can this in some way be related with sparse embeddings (see Splade for example?) and if yes can it be used for hybrid search?

macrolime
0 replies
7h4m

Seems like this might be used as an alternative to embeddings for creating semantic search. Run a document collection through this to get the activations, run the prompt through the same thing and use the sparse autoencoder to identify which documents has the most activation of features in common with features that the prompt activates. You don't have to know what the features are, but if the prompt includes, say a rhetorical question, it may be able to find documents that also includes rhetorical questions. Not sure how well this would work, but it seems like something that could be interesting to try.

kovezd
0 replies
16h12m

What applications do you think this opens up?

One of the first ones I can think of is pairing it up with Browser extensions to increase productivity of knowledge workers. Think of sales people quickly assessing the viability of leads, or simply reducing noise on social media as a B2C app.

dazhbog
0 replies
14h7m

Is this like an fMRI for a neural net? We can see which regions light up depending on various topics..

I wonder if an assessment neural net can be plugged in to evaluate the regions that light up automatically.., just like when they had an AI reconstruct what the patient was looking at, from only fMRI scans!

coder1001
0 replies
4h10m

"akin to the small set of concepts a person might have in mind when reasoning about a situation"

That does not necessarily mean only a small set of neurons in our brains are engaged.

It could be that we use the whole set or a large portion, albeit in a more efficient way!

Shoop
0 replies
1d

Can anyone summarize the major differences between this and Scaling Monosemanticity?