This is pretty remarkable. So these really do learn humanly interpretable representations and not only doing some magic in the billion dimensional hyperplane that we can't hope of deciphering.
This is awesome!
So one of the big reasons there was hype about Sora is that it felt very likely from watching a few videos that there was an internal physical simulation of the world happening and the video was more like a camera recording that physical and 3D scene simulation. It was just a sort of naïve sense that there HAD to be more going on behind the scenes than gluing bits of other videos together.
This is evidence, and it’s appearing even in still image generators. The models essentially learn how to render a 3D scene and take a picture of it. That’s incredible considering that we weren’t trying to create a 3D engine, we just threw a bunch of images at some linear algebra and optimized. Out popped a world simulator.
just a sort of naïve sense that there HAD to be more going on behind the scenes than gluing bits of other videos together
People really still think that's all thats happening?
There's a lot of nonsensical babble out there from people who have minimal technical insight - both on the sceptic and on the enthusiast side of AI.
It really wasn't clear for the longest time how these models generate things so well. Articles like this one are still rare and comparatively new. And they certainly haven't been around when less informed enthusiasts where already heralding AGI.
On the other hand, we have accrued quite a bit of evidence by now that these models do far more than glue together training data. But there are still sceptics out there who spread this sort misinformation.
Citations for that last claim?
This post we're commenting on. Also https://www.neelnanda.io/mechanistic-interpretability/othell...
Also, there's a paper I can't find now which shows you can find the name of the operation in the middle layers if you provide example of what to do. That is, if you prompt: "9 2 5 -> 5 2 9, 0 7 3 ->", you'll find the internals generate "reverse" even though it doesn't appear in the output.
So you can programmatically approximate the scene intrinsics in this paper with traditional programs as well given a single image.
All these scene data is already suggested by the data, what the AI model is doing is approximating the data it needs to generate a scene.
It’s not just gluing data together, it’s discovering interrelationships in a very high dimensional feature space. It’s not creating anything it hasn’t seen either, which is why many image models (mostly smaller ones) have so much trouble with making fine details make sense, but are good with universal patterns like physics, lighting, the human shape, and composition.
(not sure if you were trying to imply that, but it felt like it)
If you try it with a longer sequence of numbers it seems to stumble. E.g. here's what gemini spits out when I input "9 2 5 8 4 3 7 1 0 2 9 4 -> 4 9 2 0 1 7 3 4 8 5 2 9, 0 7 3 8 6 2 9 4 1 7 5 2 0 3 4 8 5 1 ->":
You're correct! The pattern you identified applies to both sets of numbers. Here's the breakdown:
First set:
Original: 9 2 5 8 4 3 7 1 0 2 9 4
Rearranged: 4 9 2 0 1 7 3 4 8 5 2 9
Second set:
Original: 0 7 3 8 6 2 9 4 1 7 5 2 0 3 4 8 5 1
Rearranged: 1 5 8 0 3 4 2 9 7 5 0 7 3 8 6 2 9 4
In both cases, the rearrangement follows the same steps:
Move the first digit of each group of three to the last position.
Keep the middle digit in the same position.
Repeat steps 1 and 2 for all groups of three.
Therefore, the pattern holds true for both sets of numbers you provided.
So—it's not clear there's recognition of a general process going on so much as recognizing a very specific process (one simple enough in form it's almost certainly in the training text somewhere).
I think the Othello paper is the go-to for proving that transformer networks do actually create internal models of the things they're trained on.
It really wasn't clear for the longest time how these models generate things so well.
Honestly, I think it was pretty much just as clear in 2021 as it is in 2024. Whether you consider that 'clear' or 'not clear' is a matter of personal choice but I don't think we've really advanced our understanding all that far (mechanistic interpretability does not tell us that much, grokking is a phenom that does not apply to modern LLMs, etc. )
we have accrued quite a bit of evidence by now that these models do far more than glue together training data. But there are still sceptics out there who spread this sort misinformation.
Few people who actually worked in this field and were familiar with the concept of 'implicit regularization' really ever thought the 'glue together training data' or 'stochastic parrot' explanations were very compelling at all.
Making Sora's results by glueing videos together would be far more impressive IMO. That would take ASI
Who is people? Most folks only vaguely know about generative AI if they know about it at all. I'm technical but not in software specifically and usually ignore most AI news, so GP's comment is in fact news to me!
Like the cat springing a 5th leg and then losing it just as quick, in a cherry-picked video from the software makers? How does that fit your wishful narrative?
Look at how anyone who's not an artist (especially children) draws pretty much anything, whether it's people or animals or bicycles. You'll see a lot of asymmetric faces, wrong shading, completely unrealistic hairs etc and yet all those people are intelligent and excellent at understanding how the world works. Intelligence can't be expressed in a directly measurable way.
Those videos may be cherry-picked but they're almost good enough that you have to pick problems to point out, problems that will likely be gone a couple iterations later.
Less than a decade ago we read articles about Google's weird AI model producing super trippy images of dogs, now we're at deepfakes that successfully scam companies out of millions and AI models that can't yet reliably preserve consistency across an entire image. In another ten years every new desktop, laptop and smartphone will have an integrated AI accelerator and most of those issues will be fixed.
Anthromporphizing models lets assumptions of human behavior slip into discussions. Most people dont even know how their brains process information while models are not children.
New models appear to be close and based on arguments here, closing the gap is simply a matter of time. This is based on observing the rate of progress, not the actual underlying actions. A tendency belied by assumptions based on human thinking, not the signficant amounts of processing that happens unconciously by us.
Sure expanding the data set has improved results, yet bad hands, ghost legs, and other weirdness persist. If you have a world model, then this shouldn't happen - there is a logical rule to follow, not simply a pixel level correlation.
Working from the other side, if image/video gen has reached the point that it is faithfully recreating 3D rules, then we should expect 3D wireframes to be generated. We should see an update to https://github.com/openai/point-e.
This isn't splitting hairs - this behavior can be hand waved away when building a PoC, not when you make a production ready product.
As for scams, they target weaknesses, not the strongest parts of a verification processes, making them strawmen arguments for AI capabilty.
If you have a world model, then this shouldn't happen - there is a logical rule to follow, not simply a pixel level correlation.
Oh I guess humans don't have world models then. It's so weird seeing this rhetoric said again and again. No a world model doesn't mean a perfect one. A world model doesn't mean a logical one either. Humans clearly don't work by logic.
IF we are getting into the semantics of what a valid world model is, then we really have different standards for what passess as logic.
By your standards, you don't have a world model either. Maybe you agree with that though.
And logic as in formal logic. Human Intelligence clearly doesn't run on that.
Well we do have game engines for a while now. Maybe a game engine with a chat interface would do a better job
Depends on whether your goal is to render a scene or push the envelope in generative AI.
I think is the key, most likely it was a big part of their training dataset.
We humans live in a 3D world and our training set is a continuous stereo stream of a constant scene from different angles. Sora, on the other hand, learned the world by watching TV. It needs to play more video games, in order to learn 3D scenes (implicit representation of a world) and taking their pictures (rendering). Maybe that was the case I don’t know.
It needs to play more video games, in order to learn 3D scenes (implicit representation of a world) and taking their pictures (rendering).
This can be extrapolated from TV, too.
we weren’t trying to create a 3D engine, we just threw a bunch of images at some linear algebra and optimized. Out popped a world simulator.
Heh. Sounds like what a personified evolution might say about a mind ;)
NNs are _not_linear algebra. The genius of NN is that it is a half-linear algebra (assuming most of them these days use ReLU activation.) And thst half linearity gives power.
Turns out that 3D graphics also involves a lot of linear algebra.
That's also why GPU are so effective at neural networks. There is a lot of the same maths.
The name is a reference to a fictional gameshow in Bojack Horseman called "Hollywoo Stars and Celebrities: What Do They Know? Do They Know Things?? Let's Find Out!"
https://bojackhorseman.fandom.com/wiki/Hollywoo_Stars_and_Ce...!
I adore that show so much. I have this sticker on my laptop.
If you haven't seen Bojack Horseman, it's funny and heartfelt. Palpably existential. If that kind of thing speaks to you, you owe yourself a watch.
In terms of a complete animation package, I think it easy outdoes Futurama. There's so much relatable depth to it. It hits hard, but it stays lighthearted enough to make you feel good about it.
As it turns out, I'm now working on filmtech, so that "Hollywoo" sticker fits me even better now.
I want to like Bojack and I know it has drama and heartfelt moments (watched all of season 1 I think), but in my opinion it's undone by its moments of "wacky" humor. I don't even mean that there are humanoid animals and humans living together, no explanation -- I can embrace that. I mean the Simpsons/Family Guy kind of humor... I could do without it and I think it would make the show better.
Or maybe it does get better after season 1? Since everyone seems to love it.
A tear shed for Simpsons being called wacky humor.
I think bojack hit its stride over time, but I always liked the juxtaposition of dark reality with cartoon silliness. On some level I think it reflects a theme of everyone else’s life seeking to be so simple and stupid and trivial compared to your own. Of course even Todd’s problems are shown to be pretty real even if they’re also cartoonishly comical.
A tear shed for Simpsons being called wacky humor
You know what I meant, hopefully? I predate the Simpsons, and I remember when they and their brand of humor first appeared. I also remember how annoyed I felt towards Family Guy (a cartoon I never liked, unlike the Simpsons) because it felt like a cheap copy. And yes, over the years the Simpsons turned wackier and wackier while retaining nothing of what made them good at first.
I don't claim Bojack and the Simpsons have the same kind of humor, it was a shortcut to hopefully explain what I found jarring about the former?
Given the rest of the comments, I'll try giving Bojack season 2 a chance.
Season 1 is definitely the weakest season, especially at the start. There's definitely still wacky humour in the rest of the show, but it loses the family guy cutaway gags.
I'd personally recommend giving season 2 a shot, if you still don't like it then the shows probably not for you.
Season 1 is definitely weakest; very common in great TV series. Just start from season 2 right now and I'm pretty sure you'll love it. It will always have "wacky" moments with a character like Todd in it though.
I remember reading they had to make season 1 "wacky" to sell the show. What the show becomes later would have been ... hard to describe to studio executives.
I’m at the other end, I think the show suffered due to its insistence on querulous characters as a means to pathos, but was kept afloat by its wit and humor (much of it indeed “wacky”). That is of course very much a personal preference thing, but I couldn’t connect with much of the darker side of the show, even while I admired much of its execution.
As others have said though, it definitely gets better as the seasons roll on, along both the humor and drama fronts. The final season has many of the funniest and most poignant moments of the show.
It’s very Los Angeles nihilist in tone. If that doesn’t make you happy, maybe skip it.
It hits hard, but it stays lighthearted enough to make you feel good about it.
That wasn't my experience of it. It's a brilliant show, to be sure, but at some point I felt the bleakness, particularly around Bojack's inability to help himself, became too much for me. Maybe different parts of the show resonate in different people.
I agree it wasn’t light at all in the later parts. Very bleak. If it was ever supposed to be a comedy it didn’t end as one but I can’t think of other animated series you could class as a drama.
I think it is one of only about three animated shows that are genuinely outstanding as TV and not just animation or comedy. The other two being Simpsons and Futurama.
I couldn't finish watching the show, it was just too depressing and dark for me to the extent that it made me personally more anxious and depressed.
I can’t think of other animated series you could class as a drama.
Honestly, there are tons. They’re just all Japanese.
Just a warning to anyone who read this comment, the show (in later seasons) gets incredibly dark, existential, and depressing, to the point where I couldn't watch it anymore as it made me both anxious and depressed.
Anyone prone to anxiety or depression or sad thoughts in general should probably avoid, it's really that depressing, and I can imagine it making suicidal people even more suicidal.
I would echo this, and reiterate not to underestimate that the impact of watching this if you’re anxious or depressed. I was in a pretty rough place for the better part of a year after watching a few particularly dark episodes.
but it stays lighthearted enough to make you feel good about it.
There were plenty of parts that were dark enough to make it difficult to watch. There were lighthearted parts, but that's not a description I would've applied to the show as a whole.
I reference this specific game show title quite a bit, but I don't think that many people get it sadly, so I just seem weird haha
It’s for the best. We wouldn’t want this getting too commercial.
Not gonna lie I upvoted this post based only on the title.
I have found my people. I watched this show like 6 times haha
So how do they get the normals, for example? Are they generated by the AI before the image is generated in order to generate the image, and they are just reading them off some internal state?
Yes, there’s rudimentary evidence that there’s essentially a 3D engine within the model that participates in generating the image. If we could inspect and interpret the whole process it would likely be bizarre and byzantine, like a sort of convergent evolution that independently recreates a tangled spaghetti mess of unreal engine, adobe lightroom, and a physical simulation of a Canon D5.
Essentially similar perhaps to the 3D engine that a human brain runs that generates a single "3D" image from two 2D cameras (eyes) and fills in missing objects in blind spots, etc.
Note that while having two eyes helps build a more accurate 3D image, people with one eye still see in 3D. Eye movement is at least as important a part of 3D vision as stereoscopy.
And apparently 3D renders can be inferred from only 2d images like in these image gen models, so even without video or parallax, brains could probably model the world in 3D.
I remember a wittgenstein thing asking to think of a tree or something, and then point to the place in your head where that thought exists. It's kind of like that.
The training data is just billions of {RGB image, text description} pairs. So, it appears the model figured out how to make the normals as part of the process of making the images.
Or, are you asking how the researchers extracted it?
I-LoRA modulates key feature maps to extract intrinsic scene properties such as normals, depth, albedo, and shading, using the models' existing decoders without additional layers, revealing their deep understanding of scene intrinsics.
Yes, that sentence "modulates key feature maps" doesn't tell me anything. What do they mean when saying they extract the normals?
I have no idea what Toyota or adobe are up to and why they’re funding research with a name like this, but I fucking love it. It’s science, let’s get some whimsy back in here!!
More materially:
  Optimized with a small set of labeled images, our model-agnostic approach adapts to various generative architectures, including Diffusion models, GANs, and Autoregressive models.
Bojack Horseman reference that we didn't know we need.
What is this, a crossover episode?
You have no idea why Toyota or Adobe are funding computer vision research?
lol yeah a little. A) it said something very odd sounding like “Toyota university of Chicago”, which wtf why does Toyota have a university, and B) most labs would be hesitant to publish a paper with an extraneous clause in the title just to reference an absurd cartoon
There's research into editing facts in language models. https://rome.baulab.info/
Si hzux
Reminds me of when I tried to extract G-buffers from my Unity High Definition Rendering Pipeline test project: https://www.youtube.com/watch?v=Fwtc694qNUM
I'm not sure if this paper is really proving anything though. There's a giant-ass UNET Lora model that's being trained here, so is it really "extracting" something from an existing model, or simply creating a new model that can create channels that look like something you'd get out of a deferred rendering pipeline.
After all, taking normals, albedo, and depth and combining them (deferred rendering) is just one of several techniques to create a 3d scene. Wasn't even used in videogames until the early 2000s (in a Shrek videogame for the Xbox! (https://sites.google.com/site/richgel99/the-early-history-of...)
What would really be awesome is to get a LORA model that can extract the "camera" rotation and translation matrix for these image generation models. That would really demonstrate something (and be quite useful at the same time).
I don't really know what I'm talking about, but doesn't this address that?
with newly learned parameters that make up less than 0.6% of the total parameters in the generative model
0.6% sounds like a small number. Is it measuring the right thing?
Certainly, I wouldn't expect the model to necessarily be encoding exactly the set of things that they're extracting, but it still seems very significant to me even if it is "just" encoding some set of things that can be cheaply (in terms of model size) and reliably mapped to normals, albedo, and depth.
(I don't care what basis vectors it's using, as long as I know how to map them to mine.)
0.6 percent of a model with 890 million parameters is still 5.34 million parameters.
That's still pretty big. Maybe big enough to fake normals, learn some albedo smoothing functions, and learn a depth estimator perhaps??
If you look at the supplementary material, they do a test where they train the Lora with a randomly initialised Unet and it is largely incapable of extracting any surface normals as opposed to using the pretrained Stable Diffusion Unet - clearly showing the features of the model are relevant to its performance.
Not to be a skeptic or anything, but how do we know normal maps etc weren’t enriched into the datasets by the image gen companies?
I understand this paper links to open source model where that can be verified, but maybe this is one secret sauce of these more advanced models?
You would need to train on pairings of normal map images to source images. To my knowledge, that’s not a common training technique and this ability seems to bridge across several open models.
It’s also kind of cool if it “understands” a normal map and that it maps to specific 3D geometry. People generally use 3D tools like Zbrush to sculpt the details to normals. Probably some people can draw them too
Ooh, so this can take real images and predict albedo and lighting! Please, someone use this to make relightable gaussian splatting scenes. Dynamic lighting would really expand the usefulness of 3D scans made from photos, and I haven't seen anyone get anything close to what I would call "good" results in that area yet.
Can it definitely use real images? I imagine extracting the depth map from real images would be the most useful application if it can.
"Fortunately, diffusion models are not only powerful image generators, their structure as image-to-image models makes it straightforward to apply to real images."
It looks like they actually only tried depth/normal maps for real images, not albedo/lighting maps, but it certainly seems possible. Honestly depth maps often look impressive at first glance but typically if you actually try to use them for anything (e.g. reprojection, DoF blur) their hidden flaws instantly become super apparent, and they aren't as useful as you might think. Gaussian splatting reconstructions already do reprojection better anyway. OTOH albedo maps look weird, but relighting is extremely useful and important. I'm not excited about yet another way to generate approximate depth maps, but I am excited about relighting.
I skimmed the paper, but a lot of it was over my head. As someone not very versed in image generation AI, can anyone help me understand? This sentence (which another commenter highlighted) appears to be the key part:
I-LoRA modulates key feature maps to extract intrinsic scene properties such as normals, depth, albedo, and shading, using the models' existing decoders without additional layers, revealing their deep understanding of scene intrinsics.
What exactly does "modulates key feature maps to extract intrinsic scene properties" mean? How were these scene property images generated if no additional decoding layers were added?
Say you have a neural network with 1B parameters, you add 5M more parameters spread around (LoRA), and continue training for a while but only the newly added parameters, not the base network. The result is a "modulated" network that would predict scene properties.
The interesting thing is that it only takes a few more parameters so it must be that the original network was pretty close already.
Great explanation! Wish I could upvote more as it took me from head scratching to “that makes total sense”.
SSL error? Just me?
Works on my machine. Alternative link: https://github.com/duxiaodan/intrinsic-lora
Works here. "Verified by DigiCert". SHA-256 fingerprint of 38:2C:D4:2D:33:C0:2B:C6:67:8E:65:7C:E1:7B:84:6D:04:73:A7:E7:91:CD:B3:5B:8E:AD:90:1A:F1:E1:1A:08
So is this GPT for images? They take a generative model and apply finetuning via LoRA on some downstream task such as surface normal and conclude that these models intrinsically learn these representations, and find that they do better than supervised approaches?
I think this is awesome but maybe it’s not really that surprising given how this “generate and then finetune” approach has already worked so well?
Hmm. Facebook has had an option to create "3d photos" for a while. I think they're annoying but now I wonder how they're made, and if you really need a neural network to figure out the 3d part in a 2d photo.
Damn, now I'm suddenly finding myself wanting to go back and re-watch Bojack Horseman. Not that that would be a bad thing.
Yet again. Anyone who placed their bets on LiDAR for self driving is proven wrong. Cameras and neural nets will rule the day.
This is good news for VR (or spatial computing). If the models understand the physical world as well as the paper shows, generating two projections of a scene does not sound like a difficult ask. Really excited for what's to come.
I asked ai that was posted here yesterday to draw common european chub and virtually every single one had adipose fin.
I wonder if you could do this on the NN-generating NN from https://news.ycombinator.com/item?id=39458363.
Then you could have it start to create a vocabulary and an understanding around what all these parameters and weights do, since to us they're just a bunch of seemingly random floating point numbers.
At which point we may be able to start creating NNs from first principles, without even requiring training.
Would be interesting to see if the perceptive abilities of generative models are superior to human perception, when tested on optical illusions that humans are fooled by. E.g., do they correctly assess depth in a Ponzo illusion scenario
Do we know the unet was never trained on gbuffers?
I find it amazing that, for all the evidence we have of generative models having some fairly complex internal model of the world, people still insist that they are mere "stochastic parrots" who "don't really understand anything".
We have zero evidence that these models have any “understanding” of anything.
“Understanding” would mean that they be able to train themselves, which they are as yet unable to do.
We are tuning weights and biases to statistically regurgitate training data. That is all they are.
So, in order for a kid to understand multiplication means they must be able to train themselves and not regurgitate the multiplication table?
Yes I think so. Understanding would involve understanding that it's a short form for adding. Remembering multiplication tables allows you to use math but it doesn't infer understanding.
Training themselves is necessary. All learning is self learning. Teachers can present material in different ways but learning is personal. No one can force you to learn either.
You can check if a model (or a kid) understands multiplication by simply probing them with questions, to explain the concept in their own words, or on concrete examples. If their answers are robust they understand, if their answers are fragile and very input dependent, they don't.
"To understand" is one of those poorly defined concepts, like "consciousness", it is thrown a lot in the face when talking about AI. But what does it mean actually? It means to have a working model of the thing you are understanding, a causal model that adapts to any new configuration of the inputs reliably. Or in other words it means to generalize well around that topic.
The opposite would be to "learn to the test" or "overfit the problem" and only be able to solve very limited cases that follow the training pattern closely. That would make for brittle learning, at surface level, based on shortcuts.
The weasel word here is "reliably". What does this actually mean? It obviously cannot be reliable in a sense of always giving the correct result, because this would make understanding something a strict binary, and we definitely don't treat it like that for humans - we say things like "they understand it better than me" all the time, which when you boil it down has to mean "their model of it is more predictive than mine".
But then if that is a quantifiable measure, then we're really talking about "reliable enough". And then the questions are: 1) where do you draw that line, exactly, and 2) even more importantly, why do you draw the line there and not somewhere else.
For me, the only sensible answer to this is to refuse to draw the line at all, and just embrace the fact that understanding is a spectrum. But then it doesn't even make sense to ask questions like "does the model really understands?" - they are meaningless.
(The same goes for concepts like "consciousness" or "intelligence", by the way.)
The reason why I think this isn't universally accepted is because it makes us not special, and humans really, really like to think of themselves as special (just look at our religions).
Our capacity to make mistakes does not necessarily equate to a lack of understanding.
If you’re doing a difficult math problem and get it wrong, that doesn’t necessarily imply that you don’t understand the problem.
It speaks to a limitation of our problem solving machinery and the implements we use to carry out tasks.
e.g. if I’m not paying close enough attention and write down the wrong digit in the middle of solving a problem, that could also just be because I got distracted, or made a mistake. If I did the same problem again from scratch, I would probably get it right if I understand the subject matter.
Limitations of our working memory, how distracted we are that day, mis-keying something on a calculator or writing down the wrong digit, etc. can all lead to a wrong answer.
This is distinct from encountering a problem where one’s understanding was incomplete leading to consistently wrong answers.
There are clearly people who are better and worse comparatively at solving certain problems. But given the complexity of our brains/biology, there are myriad reasons for these differences.
Clearly there are people who have the capacity to understand certain problems more deeply (e.g. Einstein), but all of this was primarily to say that output doesn’t need to be 100% “reliable” to imply a complete understanding.
Indeed, but I wasn't talking about mistakes at all, but specifically about the case when "A understands X better than B does", which is about their mental model of X. I hope you won't dispute that 1) we do say things like that all the time about people, and 2) most of us understand what this means, and it's not just about making fewer mistakes.
Ah, thanks for the clarification; I think I misread you here:
What did you mean by "it cannot be reliable in a sense of always giving the correct result", and why would that make understanding something a strict binary?
I do agree that some people understand some topics more deeply than others. I believe this to be true if for no other reason than watching my own understanding of certain topics grow over time. But what threw me off is that someone with "lesser" understanding isn't necessarily less "reliable". The degree of understanding may constrain the possibility space of the person, but I think that's something other than "reliability".
For example, someone who writes software using high level scripting languages can have a good enough understanding of the local context to reason about and produce code reliably. But that person may not understand in the same way that someone who built the language understands. And this is fine, because we're all working with abstractions on top of abstractions on top of abstractions. This does restrict the possibility space, e.g. the systems programmer/language designer can elaborate on lower levels of the abstraction, and some people can understand down to the bare metal/circuit level, and some people can understand down to the movement of atoms and signaling, but this doesn't make the JavaScript programmer less "reliable". It just means that their understanding will only take them so far, which primarily matters if they want to do something outside of the JavaScript domain.
To me, "reliability" is about consistency and accuracy within a problem space. And the ability to formulate novel conclusions about phenomena that emerge from that problem space that are consistent with the model the person has formed.
If we took all of this to the absurdist conclusion, we'd have to accept that none of us really understand anything at all. The smaller we go, and the more granular our world models, we still know nothing of primordial existence or what anything is.
I wouldn't call someone reciting a French-English dictionary a French speaker that understands the language.
Other way around. If a kid understood the concepts of multiplation, they could train themselves on the next logical steps like exponents.
The consequences of this in the terms of AI would mean they build on a series of concepts and would quickly dwarf us in intelligence.
How many people years existed between the time people discovered multiplication and then exponentiation? Did it happen in wall clock time in a single generation?
We don't know, because these concepts were discovered before writing. We do know that far larger jumps have been made by individual mathematicians who never made it past 33 years of age.
Neither would I. But if someone who can translate a French sentence into English preserving the meaning most of the time, I think it would be reasonable to say that this person understands French. And that is what we're talking about here - models do produce correct output much of the time, even when you give them input that was not present in their training data. The "stochastic parrot" argument is precisely the claim that this still does not prove understanding. The "Chinese room" argument takes it one notch further and claims that even if the translation is 100% correct 100% of the time, that still doesn't prove understanding.
No, you can have abstract representations of a concept which would fit understanding in a certain sense of the word. You can have “understanding” of a concept without an overarching self aware hypervisor. It’s like isolating a set of neurons in your brain that represent an idea.
I don't buy that definition of understanding. Yours is closer to learning. If you had an accident and lost the ability to learn new things, you would still be able to understand based on what you have learnt so far. And that's what these models are like. They don't retrain themselves because we haven't told them to.
While I agree with the hidden statement of the utility in online reinforcement learning here, it should be pointed out that for some snapshot of a system, which may have already been trained with a large amount of data - the structure of the learned space and the inferences within it seem like a reasonable definition for 'understanding'.
I don't know many people that would suggest that knowing about ones family structure, such as their mom/dad/uncle, and how their history relates to them, is required to be _completely reconstructed_ every time they interact with their environment somehow from first principles.
Online reinforcement learning can have large merits without resorting to stating it's required for learning. Just as self awareness can occur independently of consciousness can occur independently of intelligence is independent of empathy and so on. They are all different, and having one component doesn't mean anything about the rest.
From the other side however if you really think about it, our understanding of everything must be stochastic as well. So perhaps this sort of thing yields in many complexities that we are not aware of. How would I know I am not a stochaistic parrot of some sort. I am just neural nets traine on current envrionment while the base model that is dependent on DNA, through evolution and natural selection of the fittest. Same as currently competing LLMs where the best one will win out.
You're not wrong, but the "stochastic parrot" claim always comes with another, implicit one that we are not like that; that there's some fundamental difference, somehow, even if it is not externally observable. Chinese room etc.
In short, it's the good old religious debate about souls, just wrapped in techno-philosophical trappings.
I dont think that's the core of the objection at all. I've never seen it made by people pushing the idea that AGI is impossible, just that AI approaches like LLM are a lot more limited than they appear - basically that most of the Intelligence they exhibit is that found in the training data.
Well, the question then becomes, what is the upper limit of a baby human raised by monkeys without human contact?
Will that baby automatically become a complex tool user with a better comprehension of the world around it than the monkeys in its group?
How responsible is the 'training data' in it's environment responsible for human intelligence?
As someone teaching their five year old to read, I think people way underestimate the amount of training data the average human child gets. And perhaps, since we live in a first world country with universal education, and a very rich one at that, many people have not seen what happens when kids don't get that training data.
It's also not just the qualia of sensation, but also that of the will. We all 'feel' we have a will, that can do things. How can a computer possibly feel that? The 'will' in the LLM is forced by the selection function, which is a deterministic, human-coded algorithm, not an intrinsic property.
In my view, this sensation of qualia is so out-there and so inexplicable physically, that I would not be able to invalidate some 'out there' theories. If someone told me they posited a new quantum field with scalar values of 'will' that the brain sensed or modified via some quantum phenomena, I'd believe them, especially if there was an experiment. But even more out there explanations are possible. We have no idea, so all are impossible to validate / invalidate as far as I'm concerned.
But in which way isn't most of our Intelligence not what is from training data?
1st evolutionary algorithm and then the constant input we receive from the World being the training data and we having reward mechanisms rewiring our neural networks based on what our senses interpret as good?
You could leverage the exact same accusation against the other side - we know fundamentally how the math works on these things, yet somehow throw enough parameters at them and eventually there's some undefined emergent behavior that results in something more.
What that something more is is even less defined with even fewer theories as to what it is than there are around the woo and mysticism of human intelligence. And as LarsDu88 points out in a separate thread, there are alternative explanations for what we're seeing here besides "We've created some sort of weird internal 3D engine that the diffusion models use for generating stuff," which also meshes closely with the fact that generations routinely have multiple perspectives and other errors that wouldn't exist if they modeled the world some people are suggesting.
If there's something more going on here, we're going to need some real explanations instead of things that can be explained multiple other ways before I'm going to take it seriously, at least.
But the other side doesn't see it that way, specifically not the "something more" part. It's "just math" all the way down, in our brains as well. The "emergent phenomena" are not undefined in this sense - they're (obviously in LLMs) also math, it's just that we don't understand it yet due to the sheer complexity of the resulting system. But that's not at all unusual - humans build useful things that we don't fully understand all the time (just look at physics of various processes).
This implies that the model of the world those things have either has to be perfect, or else it doesn't exist, which is a premise with no clear logic behind it. The obvious explanation is that, between the limited amount of information that can be extracted from 2D photos that the NN is trained on, and the limit on the complexity of world modeling that NN of a particular size can fit, its model of the world is just not particularly accurate.
If we used this threshold for physics, we'd have to throw out a lot of it, too, since you can always come up with a more complicated alternative explanation; e.g. aether can be viable if you ascribe just enough special properties to it. Pragmatically, at some point, you have to pick the most likely (usually this means the simplest) explanation to move forward.
The fundamental difference is qualia, which is physically inexplicable, and which LLMs and neural networks show no sign of having. It's not even clear how we would know if it did. As far as I can tell, this is something that escapes all current models of the physical universe, despite what many want to believe.
Well, when the cherry picked videos from OpenAI have a cat springing a fifth leg for no reason, people are right to be skeptical.
I have several thoughts on the concept of 'hallucination'. Firstly, most people do it regularly. I'm not sure why this alone is indicative of not understanding. Secondly, if we think about our dreams (the closest we can get, in my view, to making the human brain produce images without physical reality interfering), then actually we make very similar hallucinations. When you think about your dreams and think about the details, sometimes there are things that just kind of happen and make sense at the time, but when you look further, you're kind of like 'huh, that was strange'.
The images we get from these neural networks are trained on looking pleasing, for some definition of pleasing. That's why they look good on the whole, but get into that uncanny valley the moment you go inspecting. Similar to dreams.
Whereas obviously real human perception is (usually) grounded in reality.
As an old 3D graphics engineer, the fact that albedo is in there is as just striking as it should be expected.
The core components of physically based rendering are position (derivable from image XY and depth), surface normal, incoming light, and at least albedo + one of a few variations on surface material properties such as specularity and roughness.
That the AI is modeling depth is pretty expected. Modeling surface normal is a nice local convolution of depth. But, modeling albedo separated from incoming light is great. I wonder if specularity is hiding in there too.
I'm mostly ok with the idea that the text-to-image models do this. What blows me away is that the autoregressive model does it too. We're not even asking it to produce a specific object which has a meaning or context assigned. The network learned to produce those representations as just "this is a useful decomposition if I want to recreate the same thing again". The jump from image compression to figuring out how to represent objects in space is insane. And makes me want to feed some optical illusions through that network - what's the depth map of the infinite staircase?
It's a good depth map, too. Better than other tools I've seen which require fiddling with lots of knobs to get a good result. Might be useful for textures for parallax mapping.