They love saying things like "generative AI doesn't know physics". But the constraint that both eyes should have consistent reflection patterns is just another statistical regularity that appears in real photographs. Better training, larger models, and larger datasets, will lead to models that capture this statistical regularity. So this "one weird trick" will disappear without any special measures.
It seems that even discussion about AI is getting really polarized like everything else these days.
Comments are always one of these two types:
1 -> AI is awesome and perfect, if it isn't, another AI will make it perfect 2 -> AI is just garbage and will always be garbage
I have seen those comments; but I do wonder, to what extent that is because the comments' authors intended such positions vs. the subtlety and nuance are hard to write and easy to overlook when reading? (Ironically, humans are more boolean than LLMs, the word "nuance" itself seems a bit like ChatGPT's voice).
I'm sure people place me closer to #1 than I actually feel, simply because I'm more often responding to people who seem to be too far in the #2 direction than vice versa.
Your comment seems pretty accurate because, from my perspective, I've never seen comments of type #1. And so, despite me explicitly saying otherwise, people like the GP commenter may be reading my comments as #1.
Even within this thread, https://news.ycombinator.com/item?id=41005386, https://news.ycombinator.com/item?id=41005633, https://news.ycombinator.com/item?id=41010124, and to a lesser extent https://news.ycombinator.com/item?id=41005240 seem like #1 to my eyes, with the sentiment of "It is detectable, therefore it will be easily corrected by near-future AI." Do you read these differently?
Of these four:
The first ('''So this "one weird trick" will disappear without any special measures''' etc.) does not seem so to me, I do not read that as a claim of perfection, merely a projection of the trends already seen.
The second ('''If the computer can see it we have a discriminator than we can use in a GAN-like fashion to train the network not to make that mistake again.''') I agree with you, that's overstating what GANs can do. They're good, they're not that good.
The third ('''Once you highlight any inconsistency in AI-generated content, IMHO, it will take a nothingth of a second to "fix" that.''') I'd lean towards agreeing with you, that seems to understate the challenges involved.
The fourth ('''Well, nice find, but now all the fakes have to do is add a new layer of AI that knows how to fix the eyes.''') is technically correct, but contrary to the meme this is not the best kind of correct, and again it's downplaying the challenge same as the previous (but it is unclear to me if this is because nuance is hard to write and to read or the genuine position). Also, once you're primed to look for people who underestimate the difficulties I can easily see why you would see it as such an example as it's close enough to be ambiguous.
Nobody has given me a good reason to use it or proof that what it does is more than recombining what it hoovers up, so... I'm in the second camp.
You could just... try it. It's very impressive what it can do. It's not some catch-all solution to everything but it saves me hours of time every week. Some of the things it can do are really quite amazing; my real-life example:
I took a picture of my son's grade 9 math homework worksheet and asked ChatGPT to tell me which questions he got wrong. It did that perfectly.
But I use for the more mundane stuff like "From this long class definition, can you create a list of assignments for each property that look this: object1.propertyName = object2.propertyName" and poof.
1 -> AI is awesome and perfect, if it isn't, another AI will make it perfect 2 -> AI is just garbage and will always be garbage
3 -> An awesome AI will actually predictably be a deep negative for nearly all people (for much more mundane reasons than the Terminator-genocide-cliche), so the progress is to be dreaded and the garbage-ness hoped for.
Your 1 is warmed over techno-optimism, which is far past its sell-by date but foundational to the tech entrepreneurship space. Your 2 greatly underestimates what tech people can deliver.
3 -> AI is still a technical concept, and does not yet exist.
Your comment is polarized.
Plenty of people think AI is useful (and equally as dangerous). Only useful, not redefines-everything. “I use AI as an assistant” is a common sentiment.
I’m in the AI is very useful but horribly named camp. It is all A and no I.
I think its because at this point there is nothing else interesting to say. We've all seen AI generated images that look impressively real. We've also all seen artifacts proving they aren't perfect. None of this is really new at this point.
1 also says “anything bad that AI does was already bad before AI and you just didnt care, scale is irrelevant”.
I can't read TFA because it's probably HNed. However an artist friend of mine said generated images are easy to spot because every pixel is "perfect". Not only eyes.
Explained pretty well why I thought even non realistic ones felt ... uncanny.
However an artist friend of mine said generated images are easy to spot because every pixel is "perfect".
What does perfect mean? Does the pixels drawing 15 fingers count as "perfect"?
I think this heuristic is liable to fail in both directions. You will find images made by humans where "every pixel is perfect" (whatever that means) and you will also find AI which does mimic whatever imperfection you are looking for.
What does perfect mean?
Nothing out of focus or less detailed than other parts of the image...
Even the most cursory browsing of civitai or genai image generation subreddits shows this to not be true. Focus, bokeh, etc. are all things that can be generated by these models.
Your artist friend has deluded themselves with wishful thinking.
However an artist friend of mine said generated images are easy to spot because every pixel is "perfect".
It depends on the art. There was a discussion here a while ago that mentioned the use of Gen AI to create images of knitted / crocheted dolls. The images looked okay at a quick glance but were clearly fake because some of the colour changes weren't aligned with the position of the stitches. E.g. strands of supposed hair overlaid on the underlying stitched texture.
I'm sure there are lots of similar examples in other forms of art where the appearance is physically impossible.
AFAIK deepfakes can't mimic strong gesticulations very well, nor mimic correctly a head facing sideways.
Or was that corrected?
You think we can't generate a picture of a head facing sideways? Obviously incorrect.
The argument being that there is not very much video footage of people turning their heads, therefore not enough data to train deep fake videos / filters.
Videos are still very much in the baby phase. There are way, way easier tells when a video has been faked. We're talking about images. Turned head images are very much in scope.
While I'm not expert on state of the art you should keep in mind there is huge difference with deepfakes created 100% from scratch and those created via face swap and style transfer.
Basically it's easier to create believable gesticulations when there is footage of actual person as raw material.
deepfakes can't
There is a big difference between can't and don't. Every next generation will do more than what the previous generation did.
Well, nice find, but now all the fakes have to do is add a new layer of AI that knows how to fix the eyes.
The commercial use cases for generative art is not to make images that experts cannot discern as fake. It would be very expensive to annotate training images to have physically correct reflection images and the value would be negligible. Realistically, if you want to produce something that is impossible to prove fake, you would have a vastly easier time doing such edits manually. We are very, very, very far from being able to push button churn out undiscernable fakes. Even making generically good outputs for art is a careful process with lots of iteration.
Or just use conventional algorithm, since the fix is about formal physics. Although it will not be a true 100% fix, it could be good enough to make this test rather useless, because even now:
"There are false positives and false negatives..."
Indeed, but useful nonetheless. Solving it may be a challenge for a while, and deep fakes generated before a solution becomes broadly available will remain detectable with this technique.
Warning: photoshopped portraits (and most of pro portraits ARE photoshopped, even slightly) may add "catch lights" in the eyes, to make the portrait more "alive"
So that kind of "clues" only shows that the picture has been processed, not that the people on the picture doesn't exists or is a deepfake
And the non-professional pictures, like your everyday smartphone pictures everyone takes, pass through so many layers of computational photography that sometimes make it pretty far off from reality.
When I shot events years ago, I always used a flash for fill, even outdoors. People like the glint in the eyes that it added.
Before the photoshop times you could sus out lighting setups based on the reflections.
In an era when the creation of artificial intelligence (AI) images is at the fingertips of the masses, the ability to detect fake pictures – particularly deepfakes of people – is becoming increasingly important.
The masses having access to things wasn’t a cutoff point for me.
I would actually argue that once the masses are aware about certain technology existing and being in widespread use it becomes much easier to convince someone that a particular instance that the data is not trustworthy, so the ability to detect it through technological means becomes less important.
In the stage before widespread use people are much more easily tricked because they are unaware that others have certain capabilities which they never experienced first hand.
You're missing the flip side: falsely believing something is forged.
Now that the technology is so accessible and widespread, someone could deny truth by saying whatever audio/visual evidence was deepfaked, and people will believe it.
The Gini coefficient is normally used to measure how the light in an image of a galaxy is distributed among its pixels. This measurement is made by ordering the pixels that make up the image of a galaxy in ascending order by flux and then comparing the result to what would be expected from a perfectly even flux distribution.
Interesting, I’d only heard of the Gini coefficient as an econometric measure of income inequality.
Some decision tree algorithms use it to decide what variable to split on when creating new branches.
Also found it interesting, but for it's technical merits, as I recently had to glue some code together to analyze/compare droplet size from still frames of a high speed video of a pressurized nozzle spraying a flammable fluid. (into a fire! neat! fire! FIRE!)
This approach might have been useful to try. I ended up finding a way to use ImageJ, an open source tool published by the NIH that biologists use to automatically count bacterial colony-forming units growing on petri dishes, but it was very slow and hacky. It was not perfect, but it gave an objective way to quantify information from a large body of existing test data with zero budget. https://en.wikipedia.org/wiki/ImageJ
Isn't it easier to simply look for all the 6 fingered hands?
Won't work on a deepfake of Count Rugen, for instance.
They've already mostly fixed the extra fingers and weird hands in general issue.
I don't understand the "galaxy" terminology in the sentence: "To measure the shapes of galaxies, we analyse whether they're centrally compact, whether they're symmetric, and how smooth they are"
Can someone explain?
Given that this is from the Royal Astronomical Society, I think they're literally talking about galaxies. They're then using these same scoring functions to characterize the reflections on the subjects' eyes, and comparing the result for the two eyes -- real photos should have similar values, generated images have more variation between the two eyes.
I wouldn't be shocked if phone cameras accidentally produced weird effects like this. Case in point: https://www.theverge.com/2023/12/2/23985299/iphone-bridal-ph...
Or https://www.theverge.com/2023/3/13/23637401/samsung-fake-moo...
And there is also software that fixes you eyes for selfies and video calls.
Ok. But it does feel like we’re scraping the bottom of the barrel.
Enhance.
The film Blade Runner for a large but was about determining hunting down androids that were so close to being human.
Not part of the test, but a nifty party of the film, was about using a photograph to see what clues were in a picture by looking deeply into reflections.
As has been said, this omission can be added as a test in generating the AI images in time, but I just loved how this inadvertently reminded me of Blade Runner.
Also be on the look out for high flyin' clouds and people dancin' on a string.
I really wonder where the limit is for AI. Reality has an incredible amount of detail that you can't just simulate or emulate entirely. However, our perception is limited, and we can't process all those details. AI only has to be good enough to fool our perception, and I'm confident that every human-understandable method for identifying fakes can be fooled by generative AI. It will probably be up to AI to identify AI-generated content. Even then, noise and limited resolution will mask the flaws. For many forms of content, there will simply be no way to determine what's real.
Well, they are relatively easy to spot with the current AI software used to generate them especially if you are dealing on a daily basis with presentation attacks aka deepfakes for facial recognition. FACEIO has already deployed a very powerful model to deter such attacks for the purpose of facial authentication: https://faceio.net/security-best-practice#faceSpoof
The sample images don't show a large difference between the real and generated photo. The light sources in the real photo must have been pretty close to the subject.
Any algorithm that claims the ability to detect AI automatically must always be possible to circumvent. All one has to do is incorporate the algorithm in your image generation process, and perturb or otherwise modify your output until the image passes the test.
Once you highlight any inconsistency in AI-generated content, IMHO, it will take a nothingth of a second to "fix" that.
Random thought: GCHQ and IDF specifically seek out dyslexic employees to put on spotting "things out of place" be it a issue in a large amount of data, or something that seems wrong on a map, to a picture that contains something impossible in physics. Something about dyslexic processing provides an advantage here (not sure if I'd take this or reading at 1 word per hour), given GPTs are just NNs, I wonder if there is any "dyslexic specific" neurology you could build a NN around and apply it to problems neurodivergent minds are good at? Not sure what I'm really saying here as I only have armchair knowledge.
Am I missing something here, or are the authors incorrectly using the term "deepfake" where "AI-generated" would have been more appropriate?
There's a lot of comments here discussing how generative AI will deal with this, which is really interesting.
But if somebody's actual goal was to pass off a doctored/AI-generated image as authentic, it would be very easy to just correct the eye reflection (and other flaws) manually, no?
Out of interest, how many CAPTCHA are or were part of training? Is there any factual basis to the belief that's what it descended to?
I don't know, the example photos of deepfakes here seem... pretty good. If that's the worst they could find, then this doesn't seem useful at all.
Even in the real photos, you can see that the reflections are different in both position and shape, because the two eyeballs aren't perfectly aligned and reflections are going to be genuinely different.
And then when you look at the actual "reflections" their software is supposedly detecting (highlighted in green and blue) and you compare with the actual photo, their software is doing a terrible job detecting reflections in the first place -- missing some, and spuriously adding others that don't exist.
Maybe this is a valuable tool for spotting deepfakes, but this webpage is doing a terrible job at convincing me of that.
(Not to mention that reflections like these are often added in Photoshop for professional photography, which might have similar subtle positioning errors, and training on those photos reproduces them. So then this wouldn't tell you at all that it's an AI photo -- it might just be a real photo that someone photoshopped reflections into.)
Ah! This is a great technique! Surely now that it's published it would be easily remediable in a compositing program like Nuke, but for more casual efforts, it's a solid test.
I suspect that detecting ai generated content will becomes an arms race just like spam filtering and seo. Business will be built on using secret ml models detecting smaller and smaller irregularities in images and text. It'll be interesting to see who wins
spy vs spy, round n
How does the deep fake have the same eye shape, same cropping and same skin blemishes as the real image? Did they inpaint eyes and call it deepfake for training?
If you can see the difference than so can the computer. If the computer can see it we have a discriminator than we can use in a GAN-like fashion to train the network not to make that mistake again.
I wonder how true this is for face swap. Since actual scammers likely wouldn't generate deepfakes completely from scratch or static image.
The necklace in the right photo is a more obvious giveaway, whereas you have to look closely at eyes to see if they match.
Interesting, some portrait photographers use cross polarised light to eliminate reflection from glasses, but it has the side effect of eliminating reflection from eyes.
Did they try using this method on something that is not StyleGAN?
I took a film lighting class a long, long time ago at a community college. Even then, you could look at a closeup and tell where the lights were by the reflections in the eyes.
That still won’t make them understand physics.
This all reminds me of “fixing” mis-architected software by adding extra conditional code for every special case that is discovered to work incorrectly, instead of fixing the architecture (because no one understands it).
Maybe it will. It really depends whether it's "easier" for the network to learn an intuitive physics, versus a laundry list of superficial hacks that let it minimise loss all the same. If the list of hacks grows so long that gradient descent finds it easier to learn the actual physics, then it'll learn the physics.
Hinton argues that the easiest way to minimise loss in next token prediction is to actually understand meaning. An analogous thing may hold true in vision modelling wrt physics.
If your entire existence was constrained to seeing 2d images, not of your choosing, could a perplexity-optimizing process "learn the physics"?
Basic things that are not accessible to such a learning process:
- moving around to get a better view of a 3d object
- see actual motion
- measure the mass of an object participating in an interaction
- set up an experiment and measure its outcomes
- choose to look at a particular sample at a closer resolution (e.g. microscopy)
- see what's out of frame from a given image
I think we have at this point a lot of evidence that optimizing models to understand distributions of images is not the same thing as understanding the things in those images. In 2013 that was 'DeepDream' dog worms, in 2018 that was "this person does not exist" portraits where people's garments or hair or jewelry fused together or merged with their background. In 2022 it was diffusion images of people with too many fingers, or whose hands melted together if you asked for people shaking hands. In the Sora announcement earlier this year it was a woman's jacket morphing while the shot zoomed into her face.
I think in the same way that LLMs do better at some reasoning tasks by generating a program to produce the answer, I suspect models which are trained to generate 3D geometry and scenes, and run a simulation -> renderer -> style transfer process may end up being the better way to get to image models that "know" about physics.
They're being trained on video, 3d patches are being fed into the ViT (3rd dimension is time) instead of just 2d patches. So they should learn about motion. But they can't interact with the world so maybe can't have an intuitive understanding of weight yet. Until embodiment at least.
I mean, the original article doesn't say anything about video models (where, frankly, spotting fakes is currently much easier), so I think you're shifting what "they" are.
But still:
- input doesn't distinguish what's real vs constructed nonphysical motion (e.g. animations, moving title cards, etc)
- input doesn't distinguish what's motion of the camera versus motion of portrayed objects
- input doesn't distinguish what changes are unnatural filmic techniques (e.g. change of shot, fade-in/out) vs what are in footage
Some years ago, I saw a series of results about GANs for image completion, and they had an accidental property of trying to add points of interest. If you showed it the left half of a photo of just the ocean, horizon and sky, and asked for the right half, it would try to put a boat, or an island, because generally people don't take and publish images of just the empty ocean -- though most chunks of the horizon probably are quite empty. The distribution on images is not like reality.
Indeed. It will be very interesting when we start letting models choose their own training data. Humans and other animals do this simply by interacting with the world around them. If you want to know what is on the back of something, you simply turn it over.
My guess is that the models will come up with much more interesting and fruitful training sets than what a bunch of researchers can come up with.
The latter is always easier. Not to mention that the architectures are fundamentally curve fitters. There are many curves that can fit data, but not all curves are casually related to data. The history of physics itself is a history of becoming less wrong and many of the early attempts at problems (which you probably never learned about fwiw) were pretty hacky approximations.
Hinton is only partially correct. It entirely depends on the conditions of your optimization. If you're trying to generalize and understand causality, then yes, this is without a doubt true. But models don't train like this and most research is not pursuing these (still unknown) directions. So if we aren't conditioning our model on those aspects, then consider how many parameters they have (and aspects like superposition). Without a doubt the "superficial hacks" are a lot easier and will very likely lead to better predictions on the training data (and likely test data).
The grokking papers show that after sufficient training models can transition into a regime where both training and test error gets arbitrarily small.
Yes, this is out of reach of how we train most models today. But it demonstrates how even current models are capable of building circuits that perfectly predict (meaning understand the actual dynamics) of data given sufficient exposure.
It really isn't easier at a sufficient complexity threshold.
Truth and reality cluster.
So hyperdimensional data compression which is organized around truthful modeling versus a collection of approximations will, as complexity and dimensionality approach uncapped limits, be increasingly more efficient.
We've already seen toy models do world modeling far beyond what was being expected at the time.
This is a trend likely to continue as people underestimate modeling advantages.
Human innate understanding of physics is a laundry list of superficial hacks. People needs education and mental effort to go beyond that innate but limited understanding.
When it is said that humans innately understand physics, no one means that people innately understand the equations and can solve physics problems. I think we all know how laughable such a claim would be, because how much people struggle when learning physics and how few people even get to a moderate level (not even Goldstein, but at least calculus based physics with partial derivatives).
What people mean when saying people innately understand physics is that they have a working knowledge of many of the implications. Things like that gravity is uniformly applied from a single direction and that is the direction towards ground. That objects move in arcs or "ballistic trajectories", that straight lines are uncommon, that wires hang with hyperbolic function shapes even if they don't know that word, that snow is created from cold, that the sun creates heat, many lighting effects (which is how we also form many illusions), and so on.
Essentially, humans know that things do not fall up. One could argue that this is based on a "laundry list of superficial hacks" and they wouldn't be wrong, but they also wouldn't be right. Even when wrong, the human formulations are (more often than not) causally formulated. That is, explainable _and_ rational (rational does not mean correct, but that it follows some logic. The logic doesn't need to be right. In fact, no logic is, just some are less wrong than others).
I guess it really depends on what the meaning of gradient decent learning the physics is.
Maybe you define it to mean that the actually correct equations appear encoded in the computation of the net. But this would still be tacit knowledge. It would be kind of like a math software being aware of physics at best.
I would assume that larger models working with additional training data will eventually allow them to understand physics to the same extent as humans inspecting the world - i.e. to capture what we call Naive Physics [0]. But the limit isn't there; the next generation of GenAI could model the whole scene and then render it with ray tracing (no special casing needed).
[0] https://en.wikipedia.org/wiki/Na%C3%AFve_physics
That’s not large models “understanding physics.” Better, giving output “statistically consistent” with real physical measurements. And no one, to my knowledge, has yet succeeded in a general AI app that reverts to a deterministic calculation in response to a prompt.
chatgpt has had the ability to generate and call out to deterministic python scripts for a year now
They will all do this with a fixed seed. They just don't do that because nobody wants it.
There seems to be little basis for this assumption, as current models don’t exhibit understanding. Understanding would allow to apply it to situations that don’t match existing patterns in the training data.
Isn't that just what neural networks do? The way light falls on an object is physically deterministic, but the neural network in the brain of a human painter doesn't actually calculate rays to determine where highlights should be. A center fielder knows where to run to catch a fly ball without having to understand the physics acting on it. Similarly, we can spot things that look wrong, not because we're refering to physical math but because we have endless kludged-together rules that supercede other rules. Like: Heavy objects don't float. Except for boats which do float. Except for boats that are leaking, which don't. To then explain why something is happening we refer to specialized models, and these image generation models are too general for that, but there's no reason they couldn't refer to separate physical models to assist their output in the future.
Boats are mostly air by volume, which isn't heavy at all compared to water.
They don't have to. They just have to understand what makes a realistic picture. The author of the article isn't really employing physics either; he's comparing the eyes to each other.
This is more a comment to the word "understand" than "physics".
Yes, the models output will converge to being congruent with laws of physics by virtue of deriving that as a latent variable.
Most "humans" don't understand physics to a Platonic level and act in much the same way as a model, finding best fits among a set of parameters that produce a result that fits some correctness check.
Isn't that what AI training is in general? It has worked pretty well so far.
I dont think img-gen AI is ever going to "understand physics", but that isn't the task at hand. I don't think it is neccesary to understand physics to make good fake pictures. For that matter, i dont think understanding physics would even be a good approach to the fake picture problem.
Hypothetically, with enough information, one could predict the future (barring truly random events like radioactive decay). Generative AI is also constrained by economic forces - how much are GenAI companies willing to invest to get eyeball reflections right? Would they earn adequate revenue to cover the increase in costs to justify that feature? There are plenty of things that humanity can technically achieve, that don't get done because the incentives are not aligned- for instance, there is enough food grown to feed every human on earth and the technology to transport it, and yet we have hunger, malnutrition and famines.
This isn't how it works. As the models are improved, they learn more about reality largely on their own. Except for glaringly obvious problems (like hands, deformed limbs, etc) the improvements are really just giving the models techniques for more accurately replicating features from reasoning data. There's nobody that's like "today we're working on fingernails" or "today we're making hair physics work better": it's about making the model understand and replicate the features already present in the training dataset.
AI models aren't complete blackboxes to the people who develop them: there is careful thought behind the architecture, dataset selection and model evaluation. Assuming that you can take an existing model and simply throw more compute at it will automatically result in higher fidelity illumination modeling takes almost religious levels of faith. If moar hardware is all you need, Nvidia would have the best models in every category right now. Perhaps someone ought to write the sequel to Fred Brooks' book amd name it "The Mythical GPU-Cluster-Month".
FWIW, Google has AI-based illumination adjustment in Google Photos where one can add virtual lights - so specialized models for lighting already exist. However, I'm very cynical about a generic mixed model incidentally gaining those capabilities without specific training for it. When dealing with exponential requirements (training data, training time, GPUs, model weight size), you'll run out of resources in short order.
Seems an odd response to a poster who said “as the models are improved...”; the way the models are improved isn't just additional training to existing models, its updated model architectures.
Nvidia is making boatloads of money right now selling GPUs to companies that think they will be making boatloads of money in the future.
Nvidia has the better end of things at this very moment in time.
What you're refuting isn't what I said. I'm making the point that nobody is encoding all of the individual features of the human form and reality into their models through code or model design. You build a model by making it capable of observing details and then letting it observe the details of your training data. Nobody is spending time getting the reflections in the eyeballs working well, that comes as an emergent property of a model that's able to identify and replicate that. That doesn't mean it's a black box, it means that it's built in a general way so the researchers don't need to care about every facet of reality.
I don't know where you get this opinion from as it doesn't match the landscape that I'm witnessing. Around the huge names are many companies and business units fine-tuning foundational models on their private datasets for the specific domains they are interested in. I can think of scenarios where someone is interested in training models to generate images with accurate reflections in specific settings.
There could be edge cases, but fine tuning doesn't normally concentrate on a single specific feature. With positive and negative examples you could definitely train the eyes, but it's not what people usually do. Fine tuning is widely used to provide a specific style, clothes, or other larger scale elements.
Normally is being strained here: yes, most finetuning isn't for things like this, but quite a substantial minority is for more accurate rendering of some narrow specific feature, especially ones typically identified as signature problems of AI image gen; publicly identifying this as a way to visually distinguish AI gens makes it more likely that fine tuning effort will be directed at addressing it.
No, it’s a valid point, which I didn’t interpret as literally “we’re working on eyeballs today” but rather “we’re scaling up these imperfect methods to a trillion dollar GPU cluster”, the latter of which is genuinely something people talk about. The models will learn to mimic more and more of the long tail of the distribution of training data, which to us looks like an emergent understanding. So there’s a theoretical amount of data you could provide for them to memorize physical laws.
The issue is practical. There isn’t enough data out there to learn the long tail. If neural nets genuinely understood the world they would be getting 100% on ARC.
Willing to? Probably not much. Should? A WHOLE LOT. It is the whole enchilada.
While this might not seem like a big issue and truthfully most people don't notice, getting this right (consistently) requires getting a lot more right. It doesn't require the model knowing physics (because every training sample face will have realistic lighting). But what underlines this issue is the model understanding subtleties. No model to date accomplishes this. From image generators to language generators (LLMs). There is a pareto efficiency issue here too. Remember that it is magnitudes easier to get a model to be "80% correct" than to be "90% correct".
But recall that the devil is in the details. We live in a complex world, and what that means is that the subtleties matter. The world is (mathematically) chaotic, so small things have big effects. You should start solving problems not worrying about these, but eventually you need to move into tackling these problems. If you don't, you'll just generate enshitification. In fact, I'd argue that the difference between an amateur and an expert is knowledge of subtleties and nuance. This is both why amateurs can trick themselves into thinking they're more expert than they are and why experts can recognize when talking to other experts (I remember a thread a while ago where many people were shocked about how most industries don't give tests or whiteboard problems when interviewing candidates and how hiring managers can identify good hires from bad ones).
Yeah every person is constantly predicting the future, often even scarely accurately. I don't see how this is a hot take at all.
I’m far from an expert on this, but these are often trained in conjunction with a model that recognizes deep fakes. Improving one will improve the other, and it’s an infinite recursion.
I could see state actors being willing to invest to be able to make better propaganda or counter intelligence.
Getting the eyeballs correct will correlate with other very useful improvements.
They won’t train a better model just for that reason. It will just happen along the way as they seek to broadly improve performance and usefulness.
Popper disagrees
https://en.wikipedia.org/wiki/The_Poverty_of_Historicism
"Individual human action or reaction can never be predicted with certainty, therefore neither can the future"
See the death of Archduke Franz Ferdinand - perhaps could be predicted when it was known that he would go to Sarajevo. But before?
If you look at SciFi, some things have been predicted, but many -obvious things- haven't.
What if Trump had been killed?
And Kennedy?
Wouldn't also the adverserial model training have to the take "physics correctness" into account? As long as the image detects as "<insert celebrity> in blue dress", why would it care about correct details in eyes if nothing in the "checker" cares about that?
Current image generators don’t use an adversarial model. Though the ones that do would have eventually encoded that as well; the details to look for aren’t hard-coded.
Interesting. Apparently, I have much to learn.
GP told you how they don't work, but not how they do:
Current image generators work by training models to remove artificial noise added to the training set. Take an image, add some amount of noise, and feed it with it's description as inputs to your model. The closest the output is to the original image, the highest the reward function.
Using some tricks (a big one is training simultaneously on large and small amounts of noise), you ultimately get a model that can remove 99% noise based only on the description you feed it, and that means you can just swap out the description for what you want the model to generate and feed it pure noise, and it'll do a good job.
I read this description of the algorithm a few times and I find it fascinating because it's so simple to follow. I have a lot of questions, though, like "why does it work?", "why nobody thought of this before", and "where is the extra magical step that moves this from 'silly idea' to 'wonder work'"?
answer to the 2nd and 3rd question is mostly "vastly more computing power available", especially the kind that CUDA introduced a few years back
This was my first thought too... You can already see in the examples in the article that the model understands that photos of people tend to have reflections of lights in their eyes. It understands that both eyes tend to reflect the same number of lights. It's already modelling that there's a similarity relationship between these areas of the image (nobody has dichromia in these pictures).
I can remember when it was hard for image generators to model the 3d shape of a table, now they can very easily display 4 very convincing legs.
I don't have technical expertise here but it just seems like a natural starting point to assume that this reflection thing is a transient shortcomimg.
Or the times when the model would create six fingers. No longer.
Sometimes this is addressed not by fixing one model, but instead running post processing models that are specialized to fix particular know defects like oddities with fingers.
Ah!
But we don’t know how much larger the models will have to be, how large the data sets or how much trianing is needed, do we? They could have to be inconceivably large.
If you want to correct for this particular problem you might be better off training a face detector, an eye detector and a model that takes two eyes as input and corrects for this problem. Process then would be:
- generate image
- detect faces
- detect eyes in each face
- correct reflections in eyes
That is convoluted, though, and would get very convoluted when you want to correct for multiple such issues. It also might be problematic in handling faces with glass eyes, but you could try to ‘detect’ those with a model that is trained on the prompt.
ADetailer does exactly that. Feels like this large thread above is non-practicing for the most part.
There’s no eyes module in it by default, but it’s trivial-ish to add, and a hires eyes dataset isn’t hard to collect either.
Just found eyes model on https://civitai.com/models/150925/eyes-detection-adetailer (seems anime only)
I feel like a GAN method might work better, building a detector, and training the model to defeat the detector.
The opposite might also be true. Just having better, well curated data goes a long way. LAION worked for a long time because it's huge, but what if all the garbage images were filtered out and the annotations were better?
The early generations of image and video models used middling data because it was the only data. Since then, literally everyone with data has been working their butts off to get it cleaned up to make the next generation better.
Better data, more intricate models, and improvements to the underlying infrastructure could mean these sorts of "improvements" come mostly "for free".
Just like the 20 fingers disappeared
Those early diffusion generators sure managed to make the flesh monster in The Witcher look sane sometimes.
/s, right? I haven’t actually seen any models make this disappear really yet
Shouldn't a GAN be able to use this fact immediately in its adversarial network?
Unfortunately no. The GAN always need to be in balance and contention with the generator. You can swap out the discriminator later, but you also got to make sure your discriminator is able to identify these errors. And ML models aren't the best at noticing small details. And since they too don't understand physics, there is no reason to believe that they will encode such information, despite every image in real life requiring consistency. Also remember that there is a learning trajectory, and most certainly these small details are not learned early on in networks. The problem is that this information is post hoc trivial to identify errors, but it isn't a priori. It is also easy for you because you know physics innately and can formulate causal explanations.
Did anybody prompt a GenAI to get this output?
It wouldn't work. The models could put stuff in the eyes but it wouldn't be able to do so realistically, consistently or even a fraction of the time. The text describing the images does not typically annotate tiny details like correct reflections in the eyes so prompting for it is useless.
Here's the link about neural scaling law: https://en.wikipedia.org/wiki/Neural_scaling_law
Can you make a napkin calculation of how much better training should be, how larger models and datasets should be to overcome difference in widely separated but related pixels of the image?
I did that and results are not promising: 200 billions parameter model will not perform much better than 100 billions' one. The loss is already too small.
Also, the phenomena exemplified in the article is a problem of relation between distant entities in generated media. The same can be seen in LLM where they have inconsistencies at the beginning and end of sentences.
If generated eyes start to lie, there will be different separated but related objects to rely on.
Hi, author here of a model that does really good on this[0]. My model is SOTA and has undergone a third party user study that shows it generates convincing images of faces[1]. AND my undergrad is in physics. I'm not saying this to brag, I'm giving my credentials. That I have deep knowledge in both generating realistic human faces and in physics. I've seen hundreds of thousands of generated faces from many different models and architectures.
I can assure you, these models don't know physics. What you're seeing is the result of attention. Go ahead and skip the front matter in my paper and go look at the appendix where I show attention maps and go through artifacts.
Yes, the work is GANs, but the same principles apply to diffusion models. Just diffusion models are typically MUCH bigger and have way more training data (sure, I had access to an A100 node at the time, but even one node makes you GPU poor these days. So best to explore on GANs ):
I'll point out flaws in images in my paper, but remember that these fool people and you're now primed to see errors, and if you continue reading you'll be even further informed. In Figures 8-10 you can see the "stars" that the article talks about. You'll see mine does a lot better. But the artifact exists in all images. You can also see these errors in all of the images in the header, but they are much harder to see. But I did embed the images as large as I could into the paper, so you can zoom in quite a bit.
Now there are ways to detect deep fakes pretty readily, but it does take an expert eye. These aren't the days of StyleGAN-2 where monsters are common (well... at least on GANs and diffusion is getting there). Each model and architecture has a different unique signature but there are key things that you can look for if you want to get better at this. Here's things that I look for, and I've used these to identify real world fake profiles and you will see them across Twitter and elsewhere:
- Eyes: Eyes are complex in humans with lots of texture. Look for "stars" (inconsistent lighting), pupil dilation, pupil shape, heterochromia (can be subtle see Figure 2, last row, column 2 for example), and the texture of the iris. And also make sure to look at the edge of eyes (Figs 8-10) and
- Glasses: look for aberrations, inconsistent lighting/reflections, and pay very close attention to the edges where new textures can be created
- Necks: These are just never right. The skin wrinkles, shape, angles, etc
- Ears: These always lose detail (as seen in TFA and my paper), lose symmetry in shape, are often not lit correctly, if there are earrings then watch for the same things too (see TFA).
- Hair: Dear fucking god, it is always the hair. But I think most people might not notice this at first. If you're having trouble, start by looking at the strands. Start with Figure 8. Patches are weird, color changes, texture, direction, and more. Then try Fig 9 and TFA.
- Backgrounds: I make a joke that the best indicator to determine if you have a good quality image is how much it looks like a LinkedIn headshot. I have yet to see a generated photo that has things happening in the background that do not have errors. Both long-range and local. Look at my header image with care and look at the bottom image in row 2 (which is pretty good but has errors), row 2 column 4, and even row 1 in column 4's shadow doesn't make sense.
- Phase Artifacts: This one is discussed back in StyleGAN2 paper (Fig 6). These are still common today.
- Skin texture: Without fail, unrealistic textures are created on faces. These are hard to use in the wild though because you're typically seeing a compressed image and that creates artifacts too and you frequently need to zoom to see. They can be more apparent with post processing though.
There's more, but all of these are a result of models not knowing physics. If you are just scrolling through Twitter you won't notice many of these issues. But if you slow down and study an image, they become apparent. If you practice looking, you'll quickly learn to find the errors with little effort. I can be more specific about model differences but this comment is already too long. I can also go into detail about how we can't determine these errors from our metrics, but that's a whole other lengthy comment.
[0] https://arxiv.org/abs/2211.05770
[1] https://arxiv.org/abs/2306.04675
Agreed, but the tricks are still useful.
When there are no more tricks remaining, I think we must be pretty close to AGI.
But "better training" here is a special measure. It would take a lot of training effort to defeat this check. For example, you'd need a program or group of people who would be able to label training data as realistic/not based on the laws of physics as reflected in subjects' eyeballs.
I know there are murmurs that synthetic data (i.e. using rendering software with 3D models) was used to train some generative models, including OpenAI Sora; seems like it's the only plausible way right now to get the insane amounts of data needed to capture such statistical regularities.
A simpler process is to automatically post process the image to “fix” the eyes. Similar techniques are used to address deformities with hands and other localized issues.
Exactly. Notably, in my experiments, diffusion models based on U-Nets (e.g. SD1.4, SD2) are worse at capturing "correlations at a distance" like this in comparison to newer, DiT-based methods (e.g. SD3, PixArt).
That's not guaranteed, AI does find statistical regularities we miss but also miss some we find.
Treating knowing (or understanding) as binary is a common failing in discussions about AI.