So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
Anyway, a fun idea that worked! Love those.
To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).
This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.
Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].
[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...
Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.
If a generic human glances at an unfamiliar screen/wall/room, can they accurately, pixel-perfectly reconstruct every single element of it? Can they do it for every single screen they have seen in their entire lives?
I never said pixel perfect, but I would be surprised if whole objects , like flaming lanterns suddenly appeared.
What this demo demonstrates to me is how incredible willing we are to accept what seems familiar to us as accurate.
I bet if you look closely and objectively you will see even more anomalies. But at first watch, I didn’t see most errors because I think accepting something is more efficient for the brain.
You'd likely be surprised by a flaming lantern unless you were in Flaming Lanterns 'R Us, but if you were watching a video of a card trick and the two participants changed clothes while the camera wasn't focused on them, you may well miss that and the other five changes that came with that.
Not exactly, but our representation of what's behind us is a lot more sparse than we would assume. That is, I might not be surprised by what I see when I turn around, but it could have changed pretty radically since I last looked, and I might not notice. In fact, an observer might be quite surprised that I missed the change.
Objectively, Simons and Chabris (and many others) have a lot of data to support these ideas. Subjectively, I can say that these types of tasks (inattentional blindness, change blindness, etc.) are humbling.
Well, it's a bit of a spoiler to encounter this video in this context, but this is a very good video: https://www.youtube.com/watch?v=LRFMuGBP15U
Even having a clue why I'm linking this, I virtually guarantee you won't catch everything.
And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance.
Yup, great example! Simons's lab has done some things along exactly these lines [0], too.
[0]: https://www.youtube.com/watch?v=wBoMjORwA-4
The opposite - if you turn around and there's something that wasn't there the last time - you'll likely not notice if it's not out of place. You'll just assume it was there and you weren't paying attention.
We don't memorize things that the environment remembers for us if they aren't relevant for other reasons.
Work which exaggerates the blindness.
The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.
And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.
Why would you think the gorillas were unusual?
I understand what you mean. I believe that the authors would contend that what you're describing is a typical attentional state for an awake/aware human: focused mostly on one thing, and with surprisingly little awareness of most other things (until/unless they are in turn attended).
Furthermore, even what we attend to isn't always represented with all that much detail. Simons has a whole series of cool demonstration experiments where they show that they can swap out someone you're speaking with (an unfamiliar conversational partner like a store clerk or someone asking for directions), and you may not even notice [0]. It's rather eerie.
[0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s
Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.
That's absolutely true. It's also well-established by Simons et al. and others that healthy normal adults maintain only a very sparse visual representation of their surroundings, anchored but not perfectly predicted by attention, and this drives the unattended gorilla phenomenon (along with many others). I don't work in this domain, but I would suggest that object permanence probably starts with attending and perceiving an object, whereas the inattentional or change blindness phenomena mostly (but not exclusively) occur when an object is not attended (or only briefly attended) or attention is divided by some competing task.
I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.
edit: someone should train it on MyHouse.wad
I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.
If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.
Yes it definitely is very good for simulating gameplay footage, don't get me wrong. Its input for predicting the next frame is not just the previous frame, it has access to a whole sequence of prior frames.
But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.
It would in an SCP-themed game. Or dreamscape/Inception themed one.
Hell, "you're trapped in Doom-like dreamscape, escape before you lose your mind" is a very interesting pitch for a game. Basically take this Doom thing and make walking though a specific, unique-looking doorway from the original game to be the victory condition - the player's job would be to coerce the model to generate it, while also not dying in the Doom fever dream game itself. I'd play the hell out of this.
(Implementation-wise, just loop in a simple recognition model to continously evaluate victory condiiton from last few frames, and some OCR to detect when player's hit points indicator on the HUD drops to zero.)
(I'll happily pay $100 this year to the first project that gets this to work. I bet I'm not the only one. Doesn't have to be Doom specifically, just has to be interesting.)
To be honest, I agree! That would be an interesting gameplay concept for sure.
Mainly just wanted to temper expectations I'm seeing throughout this thread that the model is actually simulating Doom. I don't know what will be required to get from here to there, but we're definitely not there yet.
What you're pointing at mirrors the same kind of limitation in using LLMs for role-play/interactive fictions.
Maybe a hybrid approach would work. Certain things like inventory being stored as variables, lists etc.
Wouldn't be as pure though.
Or if training the model on many FPS games? Surviving in one nightmare that morphs into another, into another, into another ...
Check out the actual modern DOOM WAD MyHouse which implements these ideas. It totally breaks our preconceptions of what the DOOM engine is capable of.
https://en.wikipedia.org/wiki/MyHouse.wad
MyHouse is excellent, but it mostly breaks our perception of what the Doom engine is capable of by not really using the Doom engine. It leans heavily on engine features which were embellishments by the GZDoom project, and never existed in the original Doom codebase.
It's an empirical question, right? But they didn't do it...
But does it need to be frame-based?
What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?
A dialogue between two parties with different functionality so to speak.
(Non technical person here - just fantasizing)
In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.
An experience that isn’t asset- but rule-based.
It always blew my mind how well it worked on a 33 Mhz 486. I'm fairly sure it ran at 30 fps in 320x200. That gives it just over 17 clock cycles per pixel, and that doesn't even include time for game logic.
My memory could be wrong, though, but even if it required a 66 Mhz to reach 30 fps, that's still only 34 clocks per pixel on an architecture that required multiple clocks for a simple integer add instruction.
What would the model provide if not what we see on the screen?
The environment and everything in it.
“Everything” would mean all objects and the elements they’re made of, their rules on how they interact and decay.
A modularized ecosystem i guess, comprised of “sub-systems” of sorts.
The other model, that provides all interaction (cause for effect) could either be run artificially or be used interactively by a human - opening up the possibility for being a tree : )
This all would need an interfacing agent that in principle would be an engine simulating the second law of thermodynamics and at the same time recording every state that has changed and diverged off the driving actor’s vector in time.
Basically the “effects” model keeping track of everyones history.
In the end a system with an “everything” model (that can grow overtime), a “cause” model messing with it, brought together and documented by the “effect” model.
(Again … non technical person, just fantasizing) : )
In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.
I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.
Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.
I don't know of an example (not to say it doesn't exist) but the problem is fundamentally the same as things moving out of sight/out of frame and coming back again.
Maybe it is, but doing that with the entire scene instead of just a small part of it makes the problem massively harder, as the model needs to grow exponentially to remember more things. It isn't something that we will manage anytime soon, maybe 10-20 years with current architecture and same compute progress.
Then you make that even harder by remembering a whole game level? No, ain't gonna happen in our lifetimes without massive changes to the architecture. They would need to make a different model keep track of level state etc, not just an image to image model.
Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.
is that something that can be solved with more memory/attention/context?
or do we believe it's an inherent limitation in the approach?
I think the real question is does the player get shot from behind?
great question
tangentially related but Grand Theft Auto speedrunners often point the camera behind them while driving so cars don't spawn "behind" them (aka in front of the car)
That is kind of cool though, I would play like being lost in a dream.
If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.
It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!
Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.
It it like some kind of weird dream doom.
You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.
Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.
So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.
There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)
Just want to clarify a couple possible misconceptions:
The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.
Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.
De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.
In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.
The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.
The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.
Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.
But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.
The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.
It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.
I don't think you've understood the project completely. The model accepts player input, so frame 601 could be quite different if the player decided to turn left rather than right, or chose that moment to fire at an exploding barrel.
1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.
If it would be able to invent action and maps and let the user play "infinite doom", then it would be very different (and impressive!).
I mean... no? Not even close? Multiply the number of game states with the number of inputs at any given frame gives you a number vastly bigger than 1 billion, not even comparable. Even with 20 days of play time to train no, it's entirely likely that at no point did someone stop at a certain location and look to the left from that angle. They might have done from similar angles, but the model then has to reconstruct some sense of the geometry of the level to synthesize the frame. They might also not have arrived there from the same direction, which again the model needs some smarts to understand.
I get your point, it's very overtrained on these particular levels of Doom, which means you might as well just play Doom. But this is not a hash table lookup we're talking about, it's pretty impressive work.
This was the basis for the reasoning:
The map 1 has 2'518 walkable map units. There are 65536 angles.
2'518*65'536=165'019'648
If you capture 165M frames, you already cover all the possibilities in terms of camera / player view, but probably the diffusion models don't even need to have all the frames (the same way that LLMs don't).
I think enemy and effects are probably in there
There's also enemy motion, enemy attacks, shooting, and UI considerations, which make the combinatorials explode.
And Doom movement isn't tile based. The map may be, but you can be in many many places on a tile.
Do you have to be exactly on a tile in Doom? I thought the guy walked smoothly around the map.
Like many people in case of LLMs, you're just demonstrating unawareness of - or disbelief in - the fact that the model doesn't record training data vetbatim, but smears it out in high-dimensional space, from which it then samples. The model then doesn't recall past inputs (which are effectively under extreme lossy compression), but samples from that high-dimensional space to produce output. The high-dimensional representation by necessity captures semantic understanding of the training data.
Generating "infinite Doom" is exactly what this model is doing, as it does not capture the larger map layout well enough to stay consistent with it.
I like "conditioned brute force" better term.
Whether or not a judge understands this will probably form the basis of any precedent set about the legality of image models and copyright.
Research is the acquisition of knowledge that may or may not have practical applications.
They succeeded in the research, gained knowledge, and might be able to do something awesome with it.
It’s a success even if they don’t sell anything.
But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".
I would call it the world's least efficient video compression.
What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?
If it's trained on absolute player coordinates then it would likely just morph into the known map at those coordinates.
But it's trained on the actual screen pixel data, AFAICT. It's literally a visual imagination model, not gameplay / geometry imagination model. They had to make special provisions to the pixel data on the HUD which by its nature different than the pictures of a 3D world.
It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.
So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.
How is what you think they say not clear?
We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.
It's more like the Tetris Effect, where the model has seen so much Doom that it confabulates gameplay.
They could down convert the entire model to only utilize the subset of matrix components from stable diffusion. This approach may be able to improve internet bandwidth efficiency assuming consumers in the future have powerful enough computers.
It's a memory of a video looped to controls, so frame 1 is "I wonder how would it look if the player pressed D instead of W", then the frame 2 is based on frame 1, etc. and couple frames in, it's already not remembering, but imagining the gameplay on the fly. It's not prerecorded, it responds to inputs during generation. That's what makes it a game engine.
No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.
Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.
Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).
PS: Not great examples, but I hope you get the idea ;)
A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.
Even Google only has one Gemini (in a few versions).