return to table of content

Diffusion models are real-time game engines

vessenes
72 replies
15h2m

So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.

That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.

Anyway, a fun idea that worked! Love those.

wavemode
47 replies
6h32m

Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected

To temper this a bit, you may want to pay close attention to the demo videos. The player rarely backtracks, and for good reason - the few times the character does turn around and look back at something a second time, it has changed significantly (the most noticeable I think is the room with the grey wall and triangle sign).

This falls in line with how we'd expect a diffusion model to behave - it's trained on many billions of frames of gameplay, so it's very good at generating a plausible -next- frame of gameplay based on some previous frames. But it doesn't deeply understand logical gameplay constraints, like remembering level geometry.

dewarrn1
13 replies
5h37m

Great observation. And not entirely unlike normal human visual perception which is notoriously vulnerable to missing highly salient information; I'm reminded of the "gorillas in our midst" work by Dan Simons and Christopher Chabris [0].

[0]: https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

bamboozled
7 replies
4h21m

Are you saying if I turn around, I’ll be surprised at what I find ? I don’t feel like this is accurate at all.

matheusd
2 replies
3h59m

If a generic human glances at an unfamiliar screen/wall/room, can they accurately, pixel-perfectly reconstruct every single element of it? Can they do it for every single screen they have seen in their entire lives?

bamboozled
1 replies
3h47m

I never said pixel perfect, but I would be surprised if whole objects , like flaming lanterns suddenly appeared.

What this demo demonstrates to me is how incredible willing we are to accept what seems familiar to us as accurate.

I bet if you look closely and objectively you will see even more anomalies. But at first watch, I didn’t see most errors because I think accepting something is more efficient for the brain.

ben_w
0 replies
2h6m

You'd likely be surprised by a flaming lantern unless you were in Flaming Lanterns 'R Us, but if you were watching a video of a card trick and the two participants changed clothes while the camera wasn't focused on them, you may well miss that and the other five changes that came with that.

dewarrn1
2 replies
3h39m

Not exactly, but our representation of what's behind us is a lot more sparse than we would assume. That is, I might not be surprised by what I see when I turn around, but it could have changed pretty radically since I last looked, and I might not notice. In fact, an observer might be quite surprised that I missed the change.

Objectively, Simons and Chabris (and many others) have a lot of data to support these ideas. Subjectively, I can say that these types of tasks (inattentional blindness, change blindness, etc.) are humbling.

jerf
1 replies
2h36m

Well, it's a bit of a spoiler to encounter this video in this context, but this is a very good video: https://www.youtube.com/watch?v=LRFMuGBP15U

Even having a clue why I'm linking this, I virtually guarantee you won't catch everything.

And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance.

ajuc
0 replies
3h21m

The opposite - if you turn around and there's something that wasn't there the last time - you'll likely not notice if it's not out of place. You'll just assume it was there and you weren't paying attention.

We don't memorize things that the environment remembers for us if they aren't relevant for other reasons.

throwway_278314
1 replies
1h45m

Work which exaggerates the blindness.

The people were told to focus very deeply on a certain aspect of the scene. Maintaining that focus means explicitly blocking things not related to that focus. Also, there is social pressure at the end to have peformed well at the task; evaluating them on a task which is intentionally completely different than the one explicitly given is going to bias people away from reporting gorillas.

And also, "notice anything unusual" is a pretty vague prompt. No-one in the video thought the gorillas were unusual, so if the PEOPLE IN THE SCENE thought gorillas were normal, why would I think they were strange? Look at any TV show, they are all full of things which are pretty crazy unusual in normal life, yet not unusual in terms of the plot.

Why would you think the gorillas were unusual?

dewarrn1
0 replies
1h30m

I understand what you mean. I believe that the authors would contend that what you're describing is a typical attentional state for an awake/aware human: focused mostly on one thing, and with surprisingly little awareness of most other things (until/unless they are in turn attended).

Furthermore, even what we attend to isn't always represented with all that much detail. Simons has a whole series of cool demonstration experiments where they show that they can swap out someone you're speaking with (an unfamiliar conversational partner like a store clerk or someone asking for directions), and you may not even notice [0]. It's rather eerie.

[0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s

robotresearcher
1 replies
4h3m

Not noticing to a gorilla that ‘shouldn’t’ be there is not the same thing as object permanence. Even quite young babies are surprised by objects that go missing.

dewarrn1
0 replies
3h44m

That's absolutely true. It's also well-established by Simons et al. and others that healthy normal adults maintain only a very sparse visual representation of their surroundings, anchored but not perfectly predicted by attention, and this drives the unattended gorilla phenomenon (along with many others). I don't work in this domain, but I would suggest that object permanence probably starts with attending and perceiving an object, whereas the inattentional or change blindness phenomena mostly (but not exclusively) occur when an object is not attended (or only briefly attended) or attention is divided by some competing task.

lawlessone
0 replies
4h20m

I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.

edit: someone should train it on MyHouse.wad

nmstoker
9 replies
5h30m

I saw a longer video of this that Ethan Mollick posted and in that one, the sequences are longer and they do appear to demonstrate a fair amount of consistency. The clips don't backtrack in the summary video on the paper's home page because they're showing a number of district environments but you only get a few seconds of each.

If I studied the longer one more closely, I'm sure inconsistencies would be seen but it seemed able to recall presence/absence of destroyed items, dead monsters etc on subsequent loops around a central obstruction that completely obscured them for quite a while. This did seem pretty odd to me, as I expected it to match how you'd described it.

wavemode
8 replies
5h14m

Yes it definitely is very good for simulating gameplay footage, don't get me wrong. Its input for predicting the next frame is not just the previous frame, it has access to a whole sequence of prior frames.

But to say the model is simulating actual gameplay (i.e. that a person could actually play Doom in this) is far fetched. It's definitely great that the model was able to remember that the gray wall was still there after we turned around, but it's untenable for actual gameplay that the wall completely changed location and orientation.

TeMPOraL
6 replies
4h29m

it's untenable for actual gameplay that the wall completely changed location and orientation.

It would in an SCP-themed game. Or dreamscape/Inception themed one.

Hell, "you're trapped in Doom-like dreamscape, escape before you lose your mind" is a very interesting pitch for a game. Basically take this Doom thing and make walking though a specific, unique-looking doorway from the original game to be the victory condition - the player's job would be to coerce the model to generate it, while also not dying in the Doom fever dream game itself. I'd play the hell out of this.

(Implementation-wise, just loop in a simple recognition model to continously evaluate victory condiiton from last few frames, and some OCR to detect when player's hit points indicator on the HUD drops to zero.)

(I'll happily pay $100 this year to the first project that gets this to work. I bet I'm not the only one. Doesn't have to be Doom specifically, just has to be interesting.)

wavemode
3 replies
4h6m

To be honest, I agree! That would be an interesting gameplay concept for sure.

Mainly just wanted to temper expectations I'm seeing throughout this thread that the model is actually simulating Doom. I don't know what will be required to get from here to there, but we're definitely not there yet.

ValentinA23
1 replies
3h54m

What you're pointing at mirrors the same kind of limitation in using LLMs for role-play/interactive fictions.

lawlessone
0 replies
2h2m

Maybe a hybrid approach would work. Certain things like inventory being stored as variables, lists etc.

Wouldn't be as pure though.

KajMagnus
0 replies
3h45m

Or if training the model on many FPS games? Surviving in one nightmare that morphs into another, into another, into another ...

kridsdale1
1 replies
2h57m

Check out the actual modern DOOM WAD MyHouse which implements these ideas. It totally breaks our preconceptions of what the DOOM engine is capable of.

https://en.wikipedia.org/wiki/MyHouse.wad

jsheard
0 replies
1h22m

MyHouse is excellent, but it mostly breaks our perception of what the Doom engine is capable of by not really using the Doom engine. It leans heavily on engine features which were embellishments by the GZDoom project, and never existed in the original Doom codebase.

dr_dshiv
0 replies
4h54m

It's an empirical question, right? But they didn't do it...

whiteboardr
6 replies
5h47m

But does it need to be frame-based?

What if you combine this with an engine in parallel that provides all geometry including characters and objects with their respective behavior, recording changes made through interactions the other model generates, talking back to it?

A dialogue between two parties with different functionality so to speak.

(Non technical person here - just fantasizing)

robotresearcher
2 replies
3h58m

In that scheme what is the NN providing that a classical renderer would not? DOOM ran great on an Intel 486, which is not a lot of computer.

whiteboardr
0 replies
3h53m

An experience that isn’t asset- but rule-based.

Sohcahtoa82
0 replies
2h13m

DOOM ran great on an Intel 486

It always blew my mind how well it worked on a 33 Mhz 486. I'm fairly sure it ran at 30 fps in 320x200. That gives it just over 17 clock cycles per pixel, and that doesn't even include time for game logic.

My memory could be wrong, though, but even if it required a 66 Mhz to reach 30 fps, that's still only 34 clocks per pixel on an architecture that required multiple clocks for a simple integer add instruction.

beepbooptheory
1 replies
5h4m

What would the model provide if not what we see on the screen?

whiteboardr
0 replies
4h24m

The environment and everything in it.

“Everything” would mean all objects and the elements they’re made of, their rules on how they interact and decay.

A modularized ecosystem i guess, comprised of “sub-systems” of sorts.

The other model, that provides all interaction (cause for effect) could either be run artificially or be used interactively by a human - opening up the possibility for being a tree : )

This all would need an interfacing agent that in principle would be an engine simulating the second law of thermodynamics and at the same time recording every state that has changed and diverged off the driving actor’s vector in time.

Basically the “effects” model keeping track of everyones history.

In the end a system with an “everything” model (that can grow overtime), a “cause” model messing with it, brought together and documented by the “effect” model.

(Again … non technical person, just fantasizing) : )

bee_rider
0 replies
5h19m

In that case, the title of the article wouldn’t be true anymore. It seems like a better plan, though.

Workaccount2
4 replies
4h36m

I don't see this as something that would be hard to overcome. Sora for instance has already shown the ability for a diffusion model to maintain object permanence. Flux recently too has shown the ability to render the same person in many different poses or images.

idunnoman1222
2 replies
4h27m

Where does a sora video turn around backwards? I can’t maintain such consistency in my own dreams.

Workaccount2
1 replies
3h34m

I don't know of an example (not to say it doesn't exist) but the problem is fundamentally the same as things moving out of sight/out of frame and coming back again.

Jensson
0 replies
1h55m

the problem is fundamentally the same as things moving out of sight/out of frame and coming back again

Maybe it is, but doing that with the entire scene instead of just a small part of it makes the problem massively harder, as the model needs to grow exponentially to remember more things. It isn't something that we will manage anytime soon, maybe 10-20 years with current architecture and same compute progress.

Then you make that even harder by remembering a whole game level? No, ain't gonna happen in our lifetimes without massive changes to the architecture. They would need to make a different model keep track of level state etc, not just an image to image model.

idunnoman1222
0 replies
4h30m

Where does a sora video turn around backwards? I don’t even maintain such consistency in my dreams.

alickz
2 replies
5h33m

is that something that can be solved with more memory/attention/context?

or do we believe it's an inherent limitation in the approach?

noiv
1 replies
5h25m

I think the real question is does the player get shot from behind?

alickz
0 replies
4h50m

great question

tangentially related but Grand Theft Auto speedrunners often point the camera behind them while driving so cars don't spawn "behind" them (aka in front of the car)

mensetmanusman
1 replies
6h17m

That is kind of cool though, I would play like being lost in a dream.

If on the backend you could record the level layouts in memory you could have exploration teams that try to find new areas to explore.

debo_
0 replies
5h56m

It would be cool for dream sequences in games to feel more like dreams. This is probably an expensive way to do it, but it would be neat!

codeflo
1 replies
5h59m

Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature.

bee_rider
0 replies
5h17m

It it like some kind of weird dream doom.

nielsbot
0 replies
57m

You can also notice in the first part of the video the ammo numbers fluctuate a bit randomly.

hoosieree
0 replies
5h20m

Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc.

TeMPOraL
0 replies
5h5m

So for the next iteration, they should add a minimap overlay (perhaps on a side channel) - it should help the model give more consistent output in any given location. Right now, the game is very much like a lucid dream - the universe makes sense from moment to moment, but without outside reference, everything that falls out of short-term memory (few frames here) gets reimagined.

Groxx
0 replies
6h6m

There's an example right at the beginning too - the ammo drop on the right changes to something green (I think that's a body?)

refibrillator
12 replies
13h5m

Just want to clarify a couple possible misconceptions:

The diffusion model doesn’t maintain any state itself, though its weights may encode some notion of cause/effect. It just renders one frame at a time (after all it’s a text to image model, not text to video). Instead of text, the previous states and frames are provided as inputs to the model to predict the next frame.

Noise is added to the previous frames before being passed into the SD model, so the RL agents were not involved with “correcting” it.

De-noising objectives are widespread in ML, intuitively it forces a predictive model to leverage context, ie surrounding frames/words/etc.

In this case it helps prevent auto-regressive drift due to the accumulation of small errors from the randomness inherent in generative diffusion models. Figure 4 shows such drift happening when a player is standing still.

rvnx
11 replies
6h31m

The concept is that if you train a Diffusion model by feeding all the possible frames seen in the game.

The training was over almost 1 billion frames, 20 days of full-time play-time, taking a screenshot of every single inch of the map.

Now you show him N frames as input, and ask it "give me frame N+1", then it gives you the frame n. N+1 back based on how it was originally seen during training.

But it is not frame N+1 from a mysterious intelligence, it's simply frame N+1 given back from past database.

The drift you mentioned is actually a clear (but sad) proof that the model does not work at inventing new frames, and can only spit out an answer from the past dataset.

It's a bit like if you train stable diffusion on Simpsons episodes, and that it outputs the next frame of an existing episode that was in the training set, but few frames later goes wild and buggy.

jetrink
9 replies
6h19m

I don't think you've understood the project completely. The model accepts player input, so frame 601 could be quite different if the player decided to turn left rather than right, or chose that moment to fire at an exploding barrel.

rvnx
8 replies
6h11m

1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.

If it would be able to invent action and maps and let the user play "infinite doom", then it would be very different (and impressive!).

OskarS
4 replies
5h46m

1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.

I mean... no? Not even close? Multiply the number of game states with the number of inputs at any given frame gives you a number vastly bigger than 1 billion, not even comparable. Even with 20 days of play time to train no, it's entirely likely that at no point did someone stop at a certain location and look to the left from that angle. They might have done from similar angles, but the model then has to reconstruct some sense of the geometry of the level to synthesize the frame. They might also not have arrived there from the same direction, which again the model needs some smarts to understand.

I get your point, it's very overtrained on these particular levels of Doom, which means you might as well just play Doom. But this is not a hash table lookup we're talking about, it's pretty impressive work.

rvnx
3 replies
5h26m

This was the basis for the reasoning:

The map 1 has 2'518 walkable map units. There are 65536 angles.

2'518*65'536=165'019'648

If you capture 165M frames, you already cover all the possibilities in terms of camera / player view, but probably the diffusion models don't even need to have all the frames (the same way that LLMs don't).

znx_0
0 replies
5h19m

I think enemy and effects are probably in there

commodoreboxer
0 replies
4h54m

There's also enemy motion, enemy attacks, shooting, and UI considerations, which make the combinatorials explode.

And Doom movement isn't tile based. The map may be, but you can be in many many places on a tile.

bee_rider
0 replies
5h13m

Do you have to be exactly on a tile in Doom? I thought the guy walked smoothly around the map.

TeMPOraL
2 replies
4h41m

Like many people in case of LLMs, you're just demonstrating unawareness of - or disbelief in - the fact that the model doesn't record training data vetbatim, but smears it out in high-dimensional space, from which it then samples. The model then doesn't recall past inputs (which are effectively under extreme lossy compression), but samples from that high-dimensional space to produce output. The high-dimensional representation by necessity captures semantic understanding of the training data.

Generating "infinite Doom" is exactly what this model is doing, as it does not capture the larger map layout well enough to stay consistent with it.

znx_0
0 replies
3h57m

I like "conditioned brute force" better term.

Workaccount2
0 replies
4h25m

Whether or not a judge understands this will probably form the basis of any precedent set about the legality of image models and copyright.

mensetmanusman
0 replies
6h15m

Research is the acquisition of knowledge that may or may not have practical applications.

They succeeded in the research, gained knowledge, and might be able to do something awesome with it.

It’s a success even if they don’t sell anything.

nine_k
8 replies
9h24m

But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

I would call it the world's least efficient video compression.

What I would like to see is the actual predictive strength, aka imagination, which I did not notice mentioned in the abstract. The model is trained on a set of classic maps. What would it do, given a few frames of gameplay on an unfamiliar map as input? How well could it imagine what happens next?

WithinReason
1 replies
9h17m

If it's trained on absolute player coordinates then it would likely just morph into the known map at those coordinates.

nine_k
0 replies
9h10m

But it's trained on the actual screen pixel data, AFAICT. It's literally a visual imagination model, not gameplay / geometry imagination model. They had to make special provisions to the pixel data on the HUD which by its nature different than the pictures of a 3D world.

PoignardAzur
1 replies
8h16m

But it's not a game. It's a memory of a game video, predicting the next frame based on the few previous frames, like "I can imagine what happened next".

It's not super clear from the landing page, but I think it's an engine? Like, its input is both previous images and input for the next frame.

So as a player, if you press "shoot", the diffusion engine need to output an image where the monster in front of you takes damage/dies.

bergen
0 replies
6h24m

How is what you think they say not clear?

We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

taneq
0 replies
6h59m

It's more like the Tetris Effect, where the model has seen so much Doom that it confabulates gameplay.

mensetmanusman
0 replies
6h12m

They could down convert the entire model to only utilize the subset of matrix components from stable diffusion. This approach may be able to improve internet bandwidth efficiency assuming consumers in the future have powerful enough computers.

TeMPOraL
0 replies
4h20m

It's a memory of a video looped to controls, so frame 1 is "I wonder how would it look if the player pressed D instead of W", then the frame 2 is based on frame 1, etc. and couple frames in, it's already not remembering, but imagining the gameplay on the fly. It's not prerecorded, it responds to inputs during generation. That's what makes it a game engine.

Sharlin
0 replies
5h47m

No, it’s predicting the next frame conditioned on past frames AND player actions! This is clear from the article. Mere video generation would be nothing new.

raghavbali
0 replies
4h28m

Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.

Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).

PS: Not great examples, but I hope you get the idea ;)

pradn
0 replies
3h8m

Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

A mistake people make all the time is that massive companies will put all their resources toward every project. This paper was written by four co-authors. They probably got a good amount of resources, but they still had to share in the pool allocated to their research department.

Even Google only has one Gemini (in a few versions).

wkcheng
35 replies
14h49m

It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

slashdave
14 replies
13h20m

Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple.

Sharlin
12 replies
13h16m

This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still.

slashdave
11 replies
12h59m

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

SeanAnderson
8 replies
12h34m

I think you're mistaken. The abstract says it's interactive, "We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction"

Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"

User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.

slashdave
2 replies
2h48m

No, I am not. The interaction is part of the training, and is used during inference, but it is not including during the process of generation.

SeanAnderson
1 replies
2h41m

Okay, I think you're right. My mistake. I read through the paper more closely and I found the abstract to be a bit misleading compared to the contents. Sorry.

slashdave
0 replies
1h32m

Don't worry. The paper is not very well written.

smusamashah
1 replies
8h2m

It's interactive but can it go beyond what it learned from the videos. As in, can the camera break free and roam around the map from different angles? I don't think it will be able to do that at all. There are still a few hallucinations in this rendering, it doesn't look it understands 3d.

Sharlin
0 replies
5h26m

You might be surprised. Generating views from novel angles based on a single image is not novel, and if anything, this model has more than a single frame as input. I’d wager that it’s quite able to extrapolate DOOM-like corridors and rooms even if it hasn’t seen the exact place during training. And sure, it’s imperfect but on the other hand it works in real time on a single TPU.

hypertele-Xii
1 replies
6h18m

Then why do monsters become blurry smudgy messes when shot? That looks like a video compression artifact of a neural network attempting to replicate low-structure image (source material contains guts exploding, very un-structured visual).

Sharlin
0 replies
5h53m

Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?

There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.

Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.

nopakos
0 replies
11h41m

Maybe it's so advanced, it knows the players' next moves, so it is a video!

raincole
1 replies
12h39m

?

I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.

slashdave
0 replies
2h48m

I did. It is generating a video, using latent information on player actions during the process (which it also predicts). It is not interactive.

InDubioProRubio
0 replies
10h49m

Video is also higher resolution, as the pixels flip for the high resolution world by moving through it. Swivelling your head without glasses, even the blurry world contains more information in the curve of pixelchange.

lokimedes
8 replies
13h37m

It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.

As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight.

If that is the case, what does aphantasia tell us?

[1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi...

dbspin
4 replies
9h46m

Worth noting that aphantasia doesn't necessarily extend to dreams. Anecdotally - I have pretty severe aphantasia (I can conjure milisecond glimpses of barely tangible imagery that I can't quite perceive before it's gone - but only since learning that visualisation wasn't a linguistic metaphor). I can't really simulate object rotation. I can't really 'picture' how things will look before they're drawn / built etc. However I often have highly vivid dream imagery. I also have excellent recognition of faces and places (e.g.: can't get lost in a new city). So there clearly is a lot of preconscious visualisation and image matching going on in some aphantasia cases, even where the explicit visual screen is all but absent.

lokimedes
2 replies
7h44m

I fabulate about this in another comment below:

Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the [aphantasia] brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

(I obviously don't know what I'm talking about, just a fellow aphant)

drowsspa
0 replies
4h11m

Yeah. In my head it's like I'm manipulating SVG paths instead of raw pixels

dbspin
0 replies
5h10m

Obviously we're all introspecting here - but my guess is that there's some kind of cross talk in aphantasic brains between the conscious narrating semantic brain and the visual module. Such that default mode visualisation is impaired. It's specifically the loss of reflexive consciousness that allows visuals to emerge. Not sure if this is related, but I have pretty severe chronic insomnia, and I often wonder if this in part relates to the inability to drift off into imagery.

zimpenfish
0 replies
8h21m

Pretty much the same for me. My aphantasia is total (no images at all) but still ludicrously vivid dreams and not too bad at recognising people and places.

jonplackett
1 replies
9h59m

What’s the aphantasia link? I’ve got aphantasia. I’m convinced though that the bit of my brain that should be making images is used for letting me ‘see’ how things are connected together very easily in my head. Also I still love games like Pictionary and can somehow draw things onto paper than I don’t really know what they look like in my head. It’s often a surprise when pen meets paper.

lokimedes
0 replies
9h48m

I agree, it is my own experience as well. Craig Venter In one of his books also credit this way of representing knowledge as abstractions as his strength in inventing new concepts.

The link may be that we actually see differences between “frames”, rather than the frames directly. That in itself would imply that a from of sub-visual representation is being processed by our brain. For aphantasia, it could be that we work directly on this representation instead of recalling imagery through the visual system.

Many people with aphantasia reports being able to visualize in their dreams, meaning that they don't lack the ability to generate visuals. So it may be that the brain has an affinity to rely on the abstract representation when "thinking", while dreaming still uses the "stable diffusion mode".

I’m no where near qualified to speak of this with certainty, but it seems plausible to me.

quickestpoint
0 replies
11h33m

As Richard Dawkins theorized, would be more accurate and less LLM like :)

quickestpoint
1 replies
11h38m

Umm, that’s a theory.

mind-blight
0 replies
5h41m

So are gravity and friction. I don't know how well tested or accepted it is, but being just a theory doesn't tell you much about how true it is without more info

nsbk
1 replies
7h23m

We are. At least that's what Lisa Feldman Barrett [1] thinks. It is worth listening to this Lex Fridman podcast: Counterintuitive Ideas About How the Brain Works [2], where she explains among other ideas how constant prediction is the most efficient way of running a brain as opposed to reaction. I never get tired of listening to her, she's such a great science communicator.

[1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett

[2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s

PunchTornado
0 replies
3h46m

Interesting talk about the brain, but the stuff she says about free will is not a very good argument. Basically it is sort of the argument that the ancient greeks made which brings the discussion into a point where you can take both directions.

mensetmanusman
1 replies
6h9m

Penrose (Nobel prize in physics) stipulates that quantum effects in the brain may allow a certain amount of time travel and back propagation to accomplish this.

wrsh07
0 replies
4h10m

You don't need back propagation to learn

This is an incredibly complex hypothesis that doesn't really seem justified by the evidence

wrsh07
0 replies
4h9m

Makes me wonder when an update to the world models paper comes out where they drop in diffusion models: https://worldmodels.github.io/

richard___
0 replies
12h10m

Did they take in the entire history as context?

dartos
0 replies
3h12m

It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

This, to me, seems extremely reductionist. Like you start with AI and work backwards until you frame all cognition as next something predictors.

It’s just the stochastic parrot argument again.

Teever
0 replies
13h39m

Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff.

danjl
24 replies
14h57m

So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg?

kragen
16 replies
14h48m

here are some ideas:

- you could build a non-real-time version of the game engine and use the neural net as a real-time approximation

- you could edit videos shot in real life to have huds or whatever and train the neural net to simulate reality rather than doom. (this paper used 900 million frames which i think is about a year of video if it's 30fps, but maybe algorithmic improvements can cut the training requirements down) and a year of video isn't actually all that much—like, maybe you could recruit 500 people to play paintball while wearing gopro cameras with accelerometers and gyros on their heads and paintball guns, so that you could get a year of video in a weekend?

w_for_wumbo
9 replies
14h30m

That feels like the endgame of video game generation. You select an art style, a video and the type of game you'd like to play. The game is then generated in real-time responding to each action with respect to the existing rule engine.

I imagine a game like that could get so convincing in its details and immersiveness that one could forget they're playing a game.

aithrowaway1987
2 replies
11h43m

Have you ever played a video game? This is unbelievably depressing. This is a future where games like Slay the Spire, with a unique art style and innovative gameplay simply are not being made.

Not to mention this childish nonsense about "forget they're playing a game," as if every game needs to be lifelike VR and there's no room for stylization or imagination. I am worried for the future that people think they want these things.

idiotsecant
0 replies
5h26m

Its a good thing. When the printing press was invented there were probably monks and scribes who thought that this new mechanical monster that took all the individual flourish out of reading was the end of literature. Instead it became a tool to make literature better and just removed a lot of drudgery. Games with individual style and design made by people will of course still exist. They'll just be easier to make.

Workaccount2
0 replies
3h35m

The problem is quite the opposite, that AI will be able to generate games so many game with so many play styles that it will totally dilute the value of all games.

Compare it to music gen algo's that can now produce music that is 100% indiscernible from generic crappy music. Which is insane given that 5 years ago it could maybe create the sound of something that maybe someone would describe as "sort of guitar-like". At this rate of progress it's probably not going to be long before AI is making better music than humans. And it's infinitely available too.

numpad0
1 replies
13h47m

IIRC, both 2001(1968) and Solaris(1972) depict that kind of things as part of alien euthanasia process, not as happy endings

hypertele-Xii
0 replies
6h13m

Also The Matrix, Oblivion, etc.

THBC
1 replies
14h18m

Holodeck is just around the corner

amelius
0 replies
9h23m

Except for haptics.

troupo
0 replies
11h35m

There are thousands of games that mimic each other, and only a handful of them are any good.

What makes you think a mechanical "predict next frame based on existing games" will be any good?

omegaworks
0 replies
14h15m

EXISTENZ IS PAUSED!

injidup
4 replies
12h27m

Why games? I will train it on 1 years worth of me attending Microsoft teams meetings. Then I will go surfing.

kqr
1 replies
7h31m

Even if you spend 40 hours a week in video conferences, you'll have to work for over four years to get one years' worth of footage. Of course, by then the models will be even better and so you might actually have a chance of going surfing.

I guess I should start hoarding video of myself now.

kragen
0 replies
6h23m

the neural net doesn't need a year of video to train to simulate your face; it can do that from a single photo. the year of video is to learn how to play the game, and in most cases lots of people are playing the same game, so you can dump all their video in the same training set

ccozan
0 replies
7h38m

most underrated comment here!

akie
0 replies
9h11m

Ready to pay for this

qznc
0 replies
12h11m

The Cloud Gaming platforms could record things for training data.

modeless
2 replies
13h26m

If you train it on multiple games then you could produce new games that have never existed before, in the same way image generation models can produce new images that have never existed before.

lewhoo
0 replies
9h45m

From what I understand that could make the engine much less stable. The key here is repetitiveness.

jsheard
0 replies
7h39m

It's unlikely that such a procedurally generated mashup would be perfectly coherent, stable and most importantly fun right out of the gate, so you would need some way to reach into the guts of the generated game and refine it. If properties as simple as "how much health this enemy type has" are scattered across an enormous inscrutable neural network, and may not even have a single consistent definition in all contexts, that's going to be quite a challenge. Nevermind if the game just catastrophically implodes and you have to "debug" the model.

slashdave
0 replies
13h19m

Well, yeah. Image diffusion models only work because you can provide large amounts of training data. For Doom it is even simpler, since you don't need to deal with compositing.

passion__desire
0 replies
12h54m

Maybe, in future, techniques of Scientific Machine Learning which can encode physics and other known laws into a model would form a base model. And then other models on top could just fine tune aspects to customise a game.

billconan
0 replies
14h55m

maybe the next step is adding text guidance and generating non-existing games.

attilakun
0 replies
13h15m

If only there was a rich 3-dimensional physical environment we could draw training data from.

refibrillator
20 replies
14h32m

There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!

Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.

IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.

mlsu
10 replies
13h42m

With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.

At which point, you effectively would be interpolating in latent space through the source code to actually "render" the game. You'd have an entire latent space computer, with an engine, assets, textures, a software renderer.

With a sufficiently powerful computer, one could imagine what interpolating in this latent space between, say Factorio and TF2 (2 of my favorites). And tweaking this latent space to your liking by conditioning it on any number of gameplay aspects.

This future comes very quickly for subsets of the pipeline, like the very end stage of rendering -- DLSS is already in production, for example. Maybe Nvidia's revenue wraps back to gaming once again, as we all become bolted into a neural metaverse.

God I love that they chose DOOM.

energy123
3 replies
13h26m

The source code lacks information required to render the game. Textures for example.

TeMPOraL
1 replies
12h3m

Obviously assets would get encoded too, in some form. Not necessarily corresponding to the original bitmaps, if the game does some consistent post-processing, the encoded thing would more likely be (equivalent to) the post-processed state.

hoseja
0 replies
11h38m

Finally, the AI superoptimizing compiler.

mistercheph
0 replies
10h59m

That’s just an artifact of the language we use to describe an implementation detail, in the sense GP means it, the data payload bits are not essentially distinct from the executable instruction bits

godelski
2 replies
9h35m

  > With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. 
You and I have very different definitions of compression

https://news.ycombinator.com/item?id=41377398

  > Someone in the field could probably correct me on that.
^__^

_hark
1 replies
7h37m

The raw capacity of the network doesn't tell you how complex the weights actually are. The capacity is only an upper bound on the complexity.

It's easy to see this by noting that you can often prune networks quite a bit without any loss in performance. I.e. the effective dimension of the manifold the weights live on can be much, much smaller than the total capacity allows for. In fact, good regularization is exactly that which encourages the model itself to be compressible.

godelski
0 replies
1h37m

I think your confusing capacity with the training dynamics.

Capacity is autological. The amount of information it can express.

Training dynamics are the way the model learns, the optimization process, etc. So this is where things like regularization come into play.

There's also architecture which affects the training dynamics as well as model capacity. Which makes no guarantee that you get the most information dense representation.

Fwiw, the authors did also try distillation.

Jensson
1 replies
11h38m

With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.

Neural nets are not guaranteed to converge to anything even remotely optimal, so no that isn't how it works. Also even though neural nets can approximate any function they usually can't do it in a time or space efficient manner, resulting in much larger programs than the human written code.

mlsu
0 replies
14m

Could is certainly a better word, yes. There is no guarantee that it will happen, only that it could. The existence of LLMs is proof of that; imagine how large and inefficient a handwritten computer program to generate the next token would be. On the flipside, human beings very effectively predicting the next token, and much more, on 5 watts is proof that LLM in their current form certainly are not the most efficient method for generating next token.

I don't really know why everyone is piling on me here. Sorry for a bit of fun speculating! This model is on the continuum. There is a latent representation of Doom in weights. some weights, not these weights. Therefore some representation of doom in a neural net could become more efficient over time. That's really the point I'm trying to make.

electrondood
0 replies
13h21m

The Holographic Principle is the idea that our universe is a projection of a higher dimensional space, which sounds an awful lot like the total simulation of an interactive environment, encoded in the parameter space of a neural network.

The first thing I thought when I saw this was: couldn't my immediate experience be exactly the same thing? Including the illusion of a separate main character to whom events are occurring?

troupo
2 replies
11h37m

one can imagine a near future where text prompts are enough to create a fun new game

Sit down and write down a text prompt for a "fun new game". You can start with something relatively simple like a Mario-like platformer.

By page 300, when you're about halfway through describing what you mean, you might understand why this is wishful thinking

reverius42
1 replies
9h52m

If it can be trained on (many) existing games, then it might work similarly to how you don't need to describe every possible detail of a generated image in order to get something that looks like what you're asking for (and looks like a plausible image for the underspecified parts).

troupo
0 replies
8h57m

Things that might work plausible in a static image will not look plausible when things are moving, especially in the game.

Also: https://news.ycombinator.com/item?id=41376722

Also: define "fun" and "new" in a "simple text prompt". Current image generators suck at properly reflecting what you want exactly, because they regurgitate existing things and styles.

SomewhatLikely
1 replies
11h10m

Video games are gonna be wild in the near future. You could have one person talking to a model producing something that's on par with a AAA title from today. Imagine the 2d sidescroller boom on Steam but with immersive photorealistic 3d games with hyper-realistic physics (water flow, fire that spreads, tornados) and full deformability and buildability because the model is pretrained with real world videos. Your game is just a "style" that tweaks some priors on look, settings, and story.

user432678
0 replies
8h15m

Sorry, no offence, but you sound like those EA execs wearing expensive suits and never played a single video game in their entire life. There’s a great documentary on how Half Life was made. Gabe Newell was interviewed by someone asking “why you did that and this, it’s not realistic”, where he answered “because it’s more fun this way, you want realism — just go outside”.

slashdave
0 replies
13h1m

where text prompts are enough to create a fun new game!

Not really. This is a reproduction of the first level of Doom. Nothing original is being created.

radarsat1
0 replies
12h1m

Most games are conditioned on text, it's just that we call it "source code" :).

(Jk of course I know what you mean, but you can seriously see text prompts as compressed forms of programming that leverage the model's prior knowledge)

magicalhippo
0 replies
10h6m

This got me thinking. Anyone tried using SD or similar to create graphics for the old classic text adventure games?

basch
0 replies
13h18m

Similarly, you could run a very very simple game engine, that outputs little more than a low resolution wireframe, and upscale it. Put all of the effort into game mechanics and none into visual quality.

I would expect something in this realm to be a little better at not being visually inconsistent when you look away and look back. A red monster turning into a blue friendly etc.

zzanz
18 replies
14h59m

The quest to run doom on everything continues. Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement? I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.

fngjdflmdflg
9 replies
14h49m

Technically speaking, isn't this the greatest possible anti-Doom

When I read this part I thought you were going to say because you're technically not running Doom at all. That is, instead of running Doom without Doom's original hardware/software environment (by porting it), you're running Doom without Doom itself.

bugglebeetle
4 replies
14h16m

Pierre Menard, Author of Doom.

jl6
0 replies
4h41m

Knee Deep in the Death of the Author.

el_memorioso
0 replies
13h18m

I applaud your erudition.

airstrike
0 replies
6h13m

OK, this is the single most perfect comment someone could make on this thread. Diffusion me impressed.

1attice
0 replies
12h6m

that took a moment, thank you

ynniv
3 replies
14h0m

It's dreaming Doom.

birracerveza
1 replies
8h58m

We made machines dream of Doom. Insane.

daemin
0 replies
8h27m

Time to make a sheep mod for Doom.

qingcharles
0 replies
12m

Do Robots Dream of E1M1?

Vecr
3 replies
13h31m

It's the No-Doom.

WithinReason
2 replies
10h38m

Undoom?

riwsky
0 replies
10h15m

It’s a mood.

jeffhuys
0 replies
9h50m

Bliss

x-complexity
2 replies
14h12m

Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement?

Not really? The greatest anti-Doom would be an infinite nest of these types of models predicting models predicting Doom at the very end of the chain.

The next step of anti-Doom would be a model generating the model, generating the Doom output.

yuchi
0 replies
12h34m

“…now it can implement Doom!”

nurettin
0 replies
12h45m

Isn't this technically a model (training step) generating a model (a neural network) generating Doom output?

Terr_
0 replies
12h1m

the Doom with the highest possible hardware requirement?

Isn't that possible by setting arbitrarily high goals for ray-cast rendering?

dtagames
13 replies
5h26m

A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.

These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

sharpshadow
5 replies
5h8m

So all it did is generate a video of the gameplay which is slightly different from the video it used for training?

TeMPOraL
4 replies
5h3m

No, it implements a 3D FPS that's interactive, and renders each frame based on your input and a lot of memorized gameplay.

sharpshadow
3 replies
4h38m

But is it playing the actual game or just making a interactive video of it?

Workaccount2
0 replies
4h14m

What is the difference?

TeMPOraL
0 replies
1h50m

Yes.

All video games are, by definition, interactive videos.

What I imagine you're asking about is, a typical game like Doom is effectively a function:

  f(internal state, player input) -> (new frame, new internal state)
where internal state is the shape and looks of loaded map, positions and behaviors and stats of enemies, player, items, etc.

A typical AI that plays Doom, which is not what's happening here, is (at runtime):

  f(last frame) -> new player input
and is attached in a loop to the previous case in the obvious way.

What we have here, however, is a game you can play but implemented in a diffusion model, and it works like this:

  f(player input, N last frames) -> new frame
Of note here is the lack of game state - the state is implicit in the contents of the N previous frames, and is otherwise not represented or mutated explicitly. The diffusion model has seen so much Doom that it, in a way, internalized most of the state and its evolution, so it can look at what's going on and guess what's about to happen. Which is what it does: it renders the next frame by predicting it, based on current user input and last N frames. And then that frame becomes the input for the next prediction, and so on, and so on.

So yes, it's totally an interactive video and a game and a third thing - a probabilistic emulation of Doom on a generative ML model.

Maxatar
0 replies
4h21m

Making an interactive video of it. It is not playing the game, a human does that.

With that said, I wholly disagree that this is not an engine. This is absolutely a game engine and while this particular demo uses the engine to recreate DOOM, an existing game, you could certainly use this engine to produce new games in addition to extrapolating existing games in novel ways.

kqr
4 replies
5h17m

even rules which are not visible on-screen.

If a rule was changed but it's never visible on the screen, did it really change?

It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

darby_nine
2 replies
5h11m

Simply?! I understand it's mechanically trivial but the fact that it's compressed such a rich conditional distribution seems far from simple to me.

It's much simpler than actually creating a game....

stnmtn
1 replies
3h5m

If someone told you 10 years ago that they were going to create something where you could play a whole new level of Doom, without them writing a single line of game logic/rendering code, would you say that that is simpler than creating a demo by writing the game themselves?

darby_nine
0 replies
21m

There are two things at play here: the complexity of the underlying mechanism, and the complexity of detailed creation. This is obviously a complicated mechanism, but in another sense it's a trivial result compared to actually reproducing the game itself in its original intended state.

znx_0
0 replies
5h4m

If a rule was changed but it's never visible on the screen, did it really change?

Well for "some" games it does really change

momojo
0 replies
1h35m

The title should be "Diffusion Models can be used to render frames given user input"

calebh
0 replies
3h35m

One thing I'd like to see is to take a game rendered with low poly assets (or segmented in some way) and use a diffusion model to add realistic or stylized art details. This would fix the consistency problem while still providing tangible benefits.

godelski
11 replies
9h48m

Doom system requirements:

  - 4 MB RAM
  - 12 MB disk space 
Stable diffusion v1

  > 860M UNet and CLIP ViT-L/14 (540M)
  Checkpoint size:
    4.27 Gb 
    7.7 GB (full EMA)
  Running on a TPU-v5e
    Peak compute per chip (bf16)  197 TFLOPs
    Peak compute per chip (Int8)  393 TFLOPs
    HBM2 capacity and bandwidth  16 GB, 819 GBps
    Interchip Interconnect BW  1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.

What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).

I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

- https://cloud.google.com/tpu/docs/v5e

- https://github.com/Farama-Foundation/ViZDoom

- https://zdoom.org/index

snickmy
6 replies
9h39m

Those are valid points, but irrelevant for the context of this research.

Yes, the computational cost is ridicolous compared to the original game, and yes, it lacks basic things like pre-computing, storing, etc. That said, you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

The fact that you can model a sequence of frames with such contextual awareness without explictly having to encode it, is the real breakthrough here. Both from a pure gaming standpoint, but on simulation in general.

godelski
2 replies
9h22m

I'm not sure what you're saying is irrelevant.

1) the model has enough memory to store not only all game assets and engine but even hundreds of "plays".

2) me mentioning that there's still a lot of room to make these things better (seems you think so too so maybe not this one?)

3) an interesting point I was wondering to compare current state of things (I mean I'll give you this but it's just a random thought and I'm not reviewing this paper in an academic setting. This is HN, not NeurIPS. I'm just curious ¯ \ _ ( ツ ) _ / ¯)

4) the point that you can rip a game

I'm really not sure what you're contesting to because I said several things.

  > it lacks basic things like pre-computing, storing, etc.
It does? Last I checked neural nets store information. I guess I need to return my PhD because last I checked there's a UNet in SD 1.4 and that contains a decoder.

snickmy
1 replies
8h22m

Sorry, probably didn't explain myself well enough

1) yes you are correct. the point i was making is that, in the context of the discovery/research, that's outside the scope, and 'easier' to do, as it has been done in other verticals (ie.: e2e self driving)

2) yep, aligned here

3) I'm not fully following here, but agree this is not NeurIPS, and no Schmidhuber's bickering.

4) The network does store information, it just doesn't store a gameplay information, which could be forced, but as per point 1, it is , and I think it is the right approach, beyond the scope of this research

godelski
0 replies
1h18m

1) I'm not sure this is outside scope. It's also not something I'd use to reject a paper were I to review this in a conference. I mean you got to start somewhere and unlike reviewer 2 I don't think any criticism is rejection criteria. That'd be silly since lack of globally optimal solutions. But I'm also unconvinced this is proven my self-driving vehicles but I'm also not an RL expert.

3) It's always hard to evaluate. I was thinking about the ripping the game and so a reasonable metric is a comparison of ability to perform the task by a human. Of course I'm A LOT faster than my dishwasher at cleaning dishes but I'm not occupied while it is going, so it still has high utility. (Someone tell reviewer 2 lol)

4) Why should we believe that it doesn't store gameplay? The model was fed "user" inputs and frames. So it has this information and this information appears useful for learning the task.

tobr
1 replies
9h18m

I suppose it also doesn't really matter what kinds of resources the game originally requires. The diffusion model isn't going to require twice as much memory just because the game does. Presumably you wouldn't even necessarily need to be able to render the original game in real time - I would imagine the basic technique would work even if you used a state of the Hollywood-quality offline renderer to render each input frame, and that the performance of the diffusion model would be similar?

godelski
0 replies
49m

Well the majority of ML systems are compression machines (entropy minimizers), so ideally you'd want to see if you can learn the assets and game mechanics through play alone (what this paper shows). Better would be to do so more efficiently than that devs themselves, finding better compression. Certainly the game is not perfectly optimized. But still, this is a step in that direction. I mean no one has accomplished this before so even with a model with far higher capacity it's progress. (I think people are interpreting my comment as dismissive. I'm critiquing but the key point I was making was about how there's likely better architectures, training methods, and all sorts of stuff to still research. Personally I'm glad there's still more to research. That's the fun part)

pickledoyster
0 replies
9h19m

you could assume that all that can be either done at the margin of this discovery OR over time will naturally improve OR will become less important as a blocker.

OR one can hope it will be thrown to the heap of nonviable tech with the rest of spam waste

dTal
3 replies
8h13m

What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute

That's the least of it. It means you can generate a game from real footage. Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

phh
0 replies
3h8m

Want a perfect flight sim? Put a GoPro in the cockpit of every airliner for a year.

I guess that's the occasion to remind that ML is splendid at interpolating, but extrapolating, maybe don't keep your hopes too high.

Namely, to have a "perfect flight sim" using GoPros, you'll need to record hundreds of stalls and crashs.

isaacfung
0 replies
7h5m

The possibility seems far beyond gaming(given enough computation resources).

You can feed it with videos of usage of any software or real world footage recorded by a Go Pro mounted on your shoulder(with body motion measured by some sesnors though the action space would be much larger).

Such a "game engine" can potentially be used as a simulation gym environment to train RL agents.

camtarn
0 replies
6h15m

Plus, presumably, either training it on pilot inputs (and being able to map those to joystick inputs and mouse clicks) or having the user have an identical fake cockpit to play in and a camera to pick up their movements.

And, unless you wanted a simulator that only allowed perfectly normal flight, you'd have to have those airliners go through every possible situation that you wanted to reproduce: warnings, malfunctions, emergencies, pilots pushing the airliner out of its normal flight envelope, etc.

SeanAnderson
10 replies
2h36m

After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.

refibrillator
2 replies
2h12m

You are incorrect, this is an interactive simulation that is playable by humans.

Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.

The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.

Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.

FrustratedMonky
1 replies
1h57m

Yeah. If isn't doing this, then what could it be doing that is worth a paper? "real-time user input and adjusts its output accordingly"

rvnx
0 replies
1h47m

There is a hint in the paper itself:

It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector"

So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022.

Chance-Device
1 replies
1h39m

I also thought this, but refer back to the paper, not the abstract:

A is the set of key presses and mouse movements…

…to condition on actions, we simply learn an embedding A_emb for each action

So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.

Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.

So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.

So yes, the users are playing it.

However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.

foota
0 replies
5m

I wonder if they could somehow feed in a trained Gaussian splats model to this to get better images?

Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.

teamonkey
0 replies
1h34m

I think someone is playing it, but it has a reduced set of inputs and they're playing it in a very specific way (slowly, avoiding looking back to places they've been) so as not to show off the flaws in the system.

The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.

I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".

As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.

pajeets
0 replies
1h48m

I knew it was too good be true but seems like real time video generation can be good enough to get to a point where it feels like a truly interactive video/game

Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd

this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.

lewhoo
0 replies
1h15m

The movement of the player seems jittery a bit so I inferred something similar on that basis.

bob1029
0 replies
2h16m

Were the agents playing at 20 real FPS, or did this occur like a Pixar movie offline?

panki27
9 replies
9h53m

Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.

meheleventyone
5 replies
9h20m

It's telling IMO that they only want people opinions based on our notoriously faulty memories rather than sitting comparable situations next to one another in the game and simulation then analyzing them. Several things jump out watching the example video.

GaggiX
4 replies
6h39m

rather than sitting comparable situations next to one another in the game and simulation then analyzing them.

That's literally how the human rating was setup if you read the paper.

meheleventyone
3 replies
4h55m

I think you misunderstand me. I don't mean a snap evaluation and deciding between two very-short competing videos which is what the participants were doing. I mean doing an actual analysis of how well the simulation matches the ground truth of the game.

What I'd posit is that it's not actually a very good replication of the game but very good a replicating short clips that almost look like the game and the short time horizons are deliberately chosen because the authors know the model lacks coherence beyond that.

GaggiX
2 replies
4h20m

I mean doing an actual analysis of how well the simulation matches the ground truth of the game.

Do you mean the PSNR and LPIPS metrics used in paper?

meheleventyone
1 replies
3h18m

No, I think I've been pretty clear that I'm interested in how mechanically sound the simulation is. Also those measures are over an even shorter duration so even less relevant to how coherent it is at real game scales.

GaggiX
0 replies
3h1m

How should this be concretely evaluated and measured? A vibe check?

freestyle24147
0 replies
5h8m

It made me laugh. Maybe they pulled random people from the hallway who had never seen the original Doom (or any FPS), or maybe only selected people who wore glasses and forgot them at their desk.

arc-in-space
0 replies
7h5m

This, watching the generated clips feels uncomfortable, like a nightmare. Geometry is "swimming" with camera movement, objects randomly appear and disappear, damage is inconsistent.

The entire thing would probably crash and burn if you did something just slightly unusual compared to the training data, too. People talking about 'generated' games often seem to fantasize about an AI that will make up new outcomes for players that go off the beaten path, but a large part of the fun of real games is figuring out what you can do within the predetermined constraints set by the game's code. (Pen-and-paper RPGs are highly open-ended, but even a Game Master needs to sometimes protects the players from themselves; whereas the current generation of AI is famously incapable of saying no.)

aithrowaway1987
0 replies
6h12m

I also noticed that they played AI DOOM very slowly: in an actual game you are running around like a madman, but in the video clips the player is moving in a very careful, halting manner. In particular the player only moves in straight lines or turns while stationary, they almost never turn while running. Also didn't see much strafing.

I suspect there is a reason for this: running while turning doesn't work properly and makes it very obvious that the system doesn't have a consistent internal 3D view of the world. I'm already getting motion sickness from the inconsistencies in straight-line movement, I can't imagine turning is any better.

Sohcahtoa82
7 replies
1h44m

It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.

Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

ninetyninenine
4 replies
1h37m

I don’t think this is not useful. This is a stepping stone for generating entire novel games.

Sohcahtoa82
3 replies
1h17m

This is a stepping stone for generating entire novel games.

I don't see how.

This game "engine" is purely mapping [pixels, input] -> new pixels. It has no notion of game state (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again), not to mention that it requires the game to already exist in order to train it.

I suppose, in theory, you could train the network to include game state in the input and output, or potentially even handle game state outside the network entirely and just make it one of the inputs, but the output would be incredibly noisy and nigh unplayable.

And like I said, all of it requires the game to already exist in order to train the network.

ninetyninenine
1 replies
1h2m

It has no notion of game state (so you can kill an enemy, turn your back, then turn around again)

Well you see a wall you turn around then turn back the wall is still there. With enough training data the model will be able to pick up the state of the enemy because it has ALREADY learned the state of the wall due to much more numerous data on the wall. It's probably impractical to do this, but this is only a stepping stone like said.

not to mention that it requires the game to already exist in order to train it.

Is this a problem? Do games not exist? Not only due we have tons of games, but we also have in theory unlimited amounts of training data for each game.

Sohcahtoa82
0 replies
43m

Well you see a wall you turn around then turn back the wall is still there. With enough training data the model will be able to pick up the state of the enemy because it has ALREADY learned the state of the wall due to much more numerous data on the wall.

It's really important to understand that ALL THE MODEL KNOWS is a mapping of [pixels, input] -> new pixels. It has zero knowledge of game state. The wall is still there after spinning 360 degrees simply because it knows that the image of a view facing away from the wall while holding the key to turn right eventually becomes an image of a view of the wall.

The only "state" that is known is the last few frames of the game screen. Because of this, it's simply not possible for the game model to know if an enemy should be shown as dead or alive once it has been off-screen for longer than those few frames. It also means that if you keeping turning away and towards an enemy, it could teleport around. Once it's off the screen for those few frames, the model will have forgotten about it.

Is this a problem? Do games not exist?

If you're trying to make a new game, then you need new frames to train the model on.

airstrike
0 replies
1h9m

> (so you can kill an enemy, turn your back, then turn around again, and the enemy could be alive again)

Sounds like a great game.

> not to mention that it requires the game to already exist in order to train it

Diffusion models create new images that did not previously exist all of the time, so I'm not sure how that follows. It's not hard to extrapolate from TFA to a model that generically creates games based on some input

msk-lywenn
0 replies
1h36m

I’d like to now to carbon footprint of that fun.

Gooblebrai
0 replies
1h40m

So true. The hustle culture is an spreading disease that has replaced the fun maker culture from the 80s/90s.

It's unavoidable though. Cost of living being increasingly expensive and romantization of entrepreneurs like they are rock stars leads towards this hustle mindset.

darrinm
5 replies
14h11m

So… is it interactive? Playable? Or just generating a video of gameplay?

vunderba
4 replies
14h9m

From the article: We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

The demo is actual gameplay at ~20 FPS.

darrinm
3 replies
13h57m

It confused me that their stated evaluations by humans are comparing video clips rather than evaluating game play.

furyofantares
2 replies
13h14m

Short clips are the only way a human will make any errors determining which is which.

darrinm
1 replies
13h1m

More relevant is if by _playing_ it they couldn’t tell which is which.

Jensson
0 replies
11h30m

They obviously can within seconds, so it wouldn't be a result. Being able to generate gameplay that looks right even if it doesn't play right is one step.

arduinomancer
5 replies
13h18m

How does the model “remember” the whole state of the world?

Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?

a_e_k
1 replies
13h15m

Watch closely in the videos and you'll see that enemies often respawn when offscreen and sometimes when onscreen. Destroyed barrels come back, ammo count and health fluctuates weirdly, etc. It's still impressive, but its not perfect in that regard.

Sharlin
0 replies
13h12m

Not unlike in (human) dreams.

Jensson
1 replies
11h35m

It doesn't even remember the state of the game you look at. Doors spawning right in front of you, particle effects turning into enemies mid flight etc, so just regular gen AI issues.

Edit: Can see this in the first 10 seconds of the first video under "Full Gameplay Videos", stairs turning to corridor turning to closed door for no reason without looking away.

csmattryder
0 replies
8h57m

There's also the case in the video (0:59) where the player jumps into the poison but doesn't take damage for a few seconds then takes two doses back-to-back - they should've taken a hit of damage every ~500-1000ms(?)

Guessing the model hasn't been taught enough about that, because most people don't jump into hazards.

raincole
0 replies
12h40m

It doesn't. You need to put the world state in the input (the "prompt", even it doesn't look like prompt in this case). Whatever not in the prompt is lost.

sitkack
4 replies
12h59m

What most programmers don't understand, that in the very near future, the entire application will be delivered by an AI model, no source, no text, just connect to the app over RDP. The whole app will be created by example, the app developer will train the app like a dog trainer trains a dog.

ukuina
1 replies
11h35m

So... https://websim.ai except over pixels instead of in your browser?

sitkack
0 replies
2h24m

Yes, and that is super neat.

Grimblewald
1 replies
12h23m

that might work for some applications, especially recreational things, I think we're a while away from it doing away with all things, especially where deterministic behavior, efficiency, or reliability are important.

sitkack
0 replies
45m

Problems for two papers down the line.

masterspy7
4 replies
14h42m

There's been a ton of work to generate assets for games using AI: 3d models, textures, code, etc. None of that may even be necessary with a generative game engine like this! If you could scale this up, train on all games in existence, etc. I bet some interesting things would happen

rererereferred
2 replies
14h12m

But can you grab what this Ai has learned and generate the 3d models, maps and code to turn it into an actual game that can run on a user's PC? That would be amazing.

passion__desire
0 replies
12h53m

Jensen Huang's vision that future games will be generated rather than rendered is coming true.

kleiba
0 replies
11h56m

What would be the point? This model has been trained on an existing game, so turning it back into assets, maps, and code would just give you a copy of the original game you started with. I suppose you could create variations of it then... but:

You don't even need to do all of that - this trained model already is the game, i.e., it's interactive, you can play the game.

whamlastxmas
0 replies
13h58m

I would absolutely love if they could take this demo, add a new door that isn’t in the original, and see what it generates behind that door

golol
4 replies
10h21m

What I understand is the folloeing: If this works so well, why didn't we have good video generation much earlier? After diffusion models were seen to work the most obvious thing to do was to generate the next frame based on previous framrs but... it took 1-2 years for good video models to appear. For example compare Sora generating minecraft video versus this method generating minecraft video. Say in both cases the player is standing on a meadow with fee inputs and watching some pigs. In the Sora video you'd expect the typical glitched to appear, like erratic, sliding movement, overlapping legs, multiplication of pigs etc. Would these glitches not appear in the GameNGen video? Why?

Closi
2 replies
10h18m

Because video is much more difficult than images (it's lots of images that have to be consistent across time, with motion following laws of physics etc), and this is much more limited in terms of scope than pure arbitrary video generation.

golol
1 replies
9h7m

This misses the point, I'm comparing two methods of generating minecraft videos.

soulofmischief
0 replies
8h57m

By simplifying the problem, we are better able to focus on researching specific aspects of generation. In this case, they synthetically created a large, highly domain-specific training set and then used this to train a diffusion model which encodes input parameters instead of text.

Sora was trained on a much more diverse dataset, and so has to learn more general solutions in order to maintain consistency, which is harder. The low resolution and simple, highly repetitive textures of doom definitely help as well.

In general, this is just an easier problem to approach because of the more focused constraints. It's also worth mentioning that noise was added during the process in order to make the model robust to small perturbations.

pantalaimon
0 replies
10h18m

I would have thought it is much easier to generate huge amounts of game footage for training, but as I understand this is not what was done here.

mo_42
3 replies
12h37m

An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

cubefox
1 replies
11h34m

A popular theory in neuroscience is that this is what the brain does:

https://slatestarcodex.com/2017/09/05/book-review-surfing-un...

It's called predictive coding. By trying to predict sensory stimuli, the brain creates a simplified model of the world, including common sense physics. Yann LeCun says that this is a major key to AGI. Another one is effective planning.

But while current predictive models (autoregressive LLMs) work well on text, they don't work well on video data, because of the large outcome space. In an LLM, text prediction boils down to a probability distribution over a few thousand possible next tokens, while there are several orders of magnitude more possible "next frames" in a video. Diffusion models work better on video data, but they are not inherently predictive like causal LLMs. Apparently this new Doom model made some progress on that front though.

ccozan
0 replies
7h34m

Howver, this is due how we actually digitize video. From a human point a view, looking in my room reduces the load to the _objects_ in the room and everyhing else is just noise ( like the color of the wall could be just a single item to remember, while otherwise in the digital world, it needs to remember all the pixels )

radarsat1
0 replies
12h4m

There has definitely been research for simulating physics based on observation, especially in fluid dynamics but also for rigid body motion and collision. It's important for robotics applications actually. You can bet people will be applying this technique in those contexts.

I think for real world application one challenge is going to be the "action" signal which is a necessary component of the conditioning signal that makes the simulation reactive. In video games you can just record the buttons, but for real world scenarios you need difficult and intrusive sensor setups for recording force signals.

(Again for robotics though maybe it's enough to record the motor commands, just that you can't easily record the "motor commands" for humans, for example)

lIl-IIIl
3 replies
10h25m

How does it know how many times it needs to shoot the zombie before it dies?

Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.

From the video it seems like it is probability based - they may die right away or it might take way longer than it should.

I love how the player's health goes down when he stands in the radioactive green water.

In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

meheleventyone
0 replies
9h19m

I love how the player's health goes down when he stands in the radioactive green water.

This is one of the bits that was weird to me, it doesn't work correctly. In the real game you take damage at a consistent rate, in the video the player doesn't and whether the player takes damage or not seems highly dependent on some factor that isn't whether or not the player is in the radioactive slime. My thought is that its learnt something else that correlates poorly.

lupusreal
0 replies
8h57m

In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

They trained this thing on bot gameplay, so I bet it does poorly when advanced strategies like deliberately inducing mob infighting are employed (the bots probably didn't do that a lot, of at all.)

golol
0 replies
10h18m

It gets a number of previous frames as input I think.

HellDunkel
3 replies
9h56m

Although impressive i must disagree. Diffusion models are not game engines. A game engine is a component to propell your game (along the time axis?). In that sense it is similar to the engine of the car, hence the name. It does not need a single working car nor a road to drive on do its job. The above is a dynamic, interactive replication of what happens when you put a car on a given road, requiring a million test drives with working vehicles. An engine would also work offroad.

MasterScrat
2 replies
9h22m

Interesting point.

In a way this is a "simulated game engine", trained from actual game engine data. But I would argue a working simulated game engine becomes a game engine of its own, as it is then able to "propell the game" as you say. The way it achieves this becomes irrelevant, in one case the content was crafted by humans, in the other case it mimics existing game content, the player really doesn't care!

An engine would also work offroad.

Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places. I'd even say extrapolation capabilities of such a model could be better than a traditional game engine, as it can make things up as it goes, while if you accidentally cross a wall in a typical game engine the screen goes blank.

jsheard
0 replies
6h42m

Here you could imagine that such a "generative game engine" could also go offroad, extrapolating what would happen if you go to unseen places.

They easily could have demonstrated this by seeding the model with images of Doom maps which weren't in the training set, but they chose not to. I'm sure they tried it and the results just weren't good, probably morphing the map into one of the ones it was trained on at the first opportunity.

HellDunkel
0 replies
6h24m

The game doom is more than a game engine, isnt it? I‘d be okay with calling the above a „simulated game“ or a „game“. My point is: let‘s not conflate the idea of a „game engine“ which is a construct of intellectual concepts put together to create a simulation of „things happening in time“ and deriving output (audio and visual). the engine is fed with input and data (levels and other assets) and then drives(EDIT) a „game“.

training the model with a final game will never give you an engine. maybe a „simulated game“ or even a „game“ but certainly not an „engine“. the latter would mean the model would be capable to derive and extract the technical and intellectual concepts and apply them elsewhere.

smusamashah
2 replies
7h53m

Has this model actually learned the 3d space of the game? Is it possible to break the camera free and roam around the map freely and view it from different angles?

I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.

kqr
0 replies
7h35m

Is it possible to break the camera free and roam around the map freely and view it from different angles?

I would assume only if the training data contained this type of imagery, which it did not. The training data (from what I understand) consisted only of input+video of actual gameplay, so that is what the model is trained to mimick.

This is like a dog that has been trained to form English words – what's impressive is not that it does it well, but that it does it at all.

Sohcahtoa82
0 replies
1h56m

Therefore I don't think it has any clue about the 3D world of the game at all.

AI models don't "know" things at all.

At best, they're just very fuzzy predictors. In this case, given the last couple frames of video and a user input, it predicts the next frame.

It has zero knowledge of the game world, game rules, interactions, etc. It's merely a mapping of [pixels, input] -> pixels.

qnleigh
2 replies
10h52m

Could a similar scheme be used to drastically improve the visual quality of a video game? You would train the model on gameplay rendered at low and high quality (say with and without ray tracing, and with low and high density meshing), and try to get it to convert a quick render into something photorealistic on the fly.

When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.

agys
1 replies
10h48m

Isn't that what Nvidia’s Ray Reconstruction and DLSS (frame generation and upscaler) are doing, more or less?

qnleigh
0 replies
9h49m

At a high level I guess so. I don't know enough about Ray Reconstruction (though the results are impressive), but I was thinking of something more drastic than DLSS. Diffusion models on static images can turn a cartoon into a photorealistic image. Doing something similar for a game, where a low-quality render is turned into something that would otherwise take seconds to render, seems qualitatively quite different from DLSS. In principle a model could fill in huge amounts of detail, like increasing the number of particles in a particle-based effect, adding shading/lighting effects...

icoder
2 replies
9h53m

This is impressive. But at the same time, it can't count. We see this every time, and I understand why it happens, but it is still intriguing. We are so close or in some ways even way beyond, and yet at the same time so extremely far away, from 'our' intelligence.

(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)

(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)

marci
1 replies
9h30m

'our' intelligence may not be the best thing we can make. It would be like trying to only make planes that flaps wings or trucks with legs. A bit like using a llm to do multiplication. Not the best tool. Biomimcry is great for inspiration, but shouldn't be a 1-to-1 copy, especialy in different scale and medium.

icoder
0 replies
8h30m

Sure, although I still think a system with less of a contrast between how well it performs 'modally' and how bad it performs incidentally, would be more practical.

What I wonder is whether LLM's will inherently always have this dichotomy and we need something 'extra' (reasoning, attention or something les biomimicried), or whether this will eventually resolves itself (to an acceptable extend) when they improve even further.

EcommerceFlow
2 replies
5h11m

Jensen said that this is the future of gaming a few months ago fyi.

weakfish
0 replies
5h6m

Who is that?

Fraterkes
0 replies
4h52m

Thousands of different people have been speculating about this kind of thing for years.

richard___
1 replies
12h13m

Uhhh… demos would be more convincing with enemies and decreasing health

Kiro
0 replies
9h16m

I see enemies and decreasing health on hit. But even if it lacked those, it seems like a pretty irrelevant nitpick that is completely underplaying what we're seeing here. The fact that this is even possible at all feels like science fiction.

nolist_policy
1 replies
11h15m

Makes me wonder... If you stand still in front of a door so all past observations only contain that door, will the model teleport you to another level when opening the door?

zbendefy
0 replies
10h32m

I think some state is also being given (or if its not, it could be given) to the network, like 3d world position/orientation of the player, that could help the neural network anchor the player in the world.

jetrink
1 replies
4h34m

What if instead of a video game, this was trained on video and control inputs from people operating equipment like warehouse robots? Then an automated system could visualize the result of a proposed action or series of actions when operating the equipment itself. You would need a different model/algorithm to propose control inputs, but this would offer a way for the system to validate and refine plans as part of a problem solving feedback loop.

Workaccount2
0 replies
4h9m

Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control

https://deepmind.google/discover/blog/rt-2-new-model-transla...

jamilton
1 replies
11h0m

I wonder if the MineRL (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io) dataset would be sufficient to reproduce this work with Minecraft.

Any other similar existing datasets?

A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.

jamilton
0 replies
9h21m

Although ideally a follow up work would be something where there won’t be any potential legal trouble with releasing the complete model so people can play it.

A similar approach but with a game where the exact input is obvious and unambiguous from the graphics alone so that you can use unannotated data might work. You’d just have to create a model to create the action annotations. I’m not sure what the point would be, but it sounds like it’d be interesting.

helloplanets
1 replies
12h26m

So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.

To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.

I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.

toppy
0 replies
11h43m

I think you've just encoded the title of the paper

harha_
1 replies
3h58m

This is so sick I don't know what to say. I never expected this, aren't the implications of this huge?

aithrowaway1987
0 replies
3h35m

I am struggling to understand a single implication of this! How does this generalize to anything other than other than playing retro games in the most expensive way possible? The very intention of this project is overfitting to data in a non-generalizable way! Maybe it's just pure engineering, that good ANNs are getting cheap and fast. But this project still seems to have the fundamental weaknesses of all AI projects:

- needs a huge amount of data, which a priori precludes a lot of interesting use cases

- flashy-but-misleading demos which hide the actual weaknesses of the AI software (note that the player is moving very haltingly compared to a real game of DOOM, where you almost never stop moving)

- AI nailing something really complicated for humans (98% effective raycasting, 98% effective Python codegen) while failing to grasp abstract concepts rigorously understood by fish (object permanence, quantity)

I am genuinely struggling to see this as a meaningful step forward. It seems more like a World's Fair exhibit - a fun and impressive diversion, but probably not a vision of the future. Putting it another way: unlike AlphaGo, Deep Blue wasn't really a technological milestone so much as a sociological milestone reflecting the apex of a certain approach to AI. I think this DOOM project is in a similar vein.

gwbas1c
1 replies
5h48m

Am I the only one who thinks this is faked?

It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.

GaggiX
0 replies
5h40m

Am I the only one who thinks this is faked?

Yes.

dysoco
1 replies
13h3m

Ah finally we are starting to see something gaming related. I'm curious as to why we haven't seen more of neural networks applied to games even in a completely experimental fashion; we used to have a lot of little experimental indie games such as Façade (2005) and I'm surprised we don't have something similar years after the advent of LLMs.

We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?

raincole
0 replies
12h34m

We could have mods for old games that generate voices for the characters for example

You mean in real time? Or just in general?

There are a lot of mods that use AI-generated voices. I'll say it's the norm of modding community now.

dabochen
1 replies
4h59m

So there is no interactivity, but the generated content is not the exact view in the training data, is this the correct understanding?

If so, is it more like imagination/hallucination rather than rendering?

og_kalu
0 replies
2h52m

It's conditioned on previous frames AND player actions so it's interactive.

broast
1 replies
13h15m

Maybe one day this will be how operating systems work.

misterflibble
0 replies
10h28m

Don't give them ideas lol terrifying stuff if that happens!

wantsanagent
0 replies
4h57m

Anyone have reliable numbers on the file sizes here? Doom.exe from my searches was around 715k, and with all assets somewhere around 10MB. It looks like the SD 1.4 files are over 2GB, so it's likely we're looking at a 200-2000x increase in file size depending on if you think of this as an 'engine' or the full game.

troupo
0 replies
11h33m

Key: "predicts next frame, recreates classic Doom". A game that was analyzed and documented to death. And the training included uncountable runs of Doom.

A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.

This is not a game engine.

Creating a new good game? Good luck with that.

throwthrowuknow
0 replies
0m

Several thoughts for future work:

1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.

2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.

3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?

4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.

throwmeaway222
0 replies
14h27m

You know how when you're dreaming and you walk into a room at your house and you're suddenly naked at school?

I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.

thegabriele
0 replies
10h12m

Wow, I bet Boston Dynamics and such are quite interested

t1c
0 replies
6h59m

They got DOOM running on a diffusion engine before GTA 6

seydor
0 replies
7h8m

I wonder how far it is from this to generating language reasoning about the game from the game itself, rather than learning a large corpus of language, like LLMs do. That would be a true grounded language generator

rrnechmech
0 replies
2h48m

To mitigate auto-regressive drift during inference, we corrupt context frames by adding Gaussian noise to encoded frames during training. This allows the network to correct information sampled in previous frames, and we found it to be critical for preserving visual stability over long time periods.

I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?

ravetcofx
0 replies
14h52m

There is going to be a flood of these dreamlike "games" in the next few years. This feels likes a bit of a breakthrough in the engineering of these systems.

piperswe
0 replies
13h53m

This is honestly the most impressive ML project I've seen since... probably O.G. DALL-E? Feels like a gem in a sea of AI shit.

nuz
0 replies
8h28m

I wonder how overfit it is though. You could fit a lot of doom resolution jpeg frames into 4gb (the size of SD1.4)

lukol
0 replies
11h2m

I believe future game engines will be state machines with deterministic algorithms that can be reproduced at any time. However, rendering said state into visual / auditory / etc. experiences will be taken over by AI models.

This will also allow players to easily customize what they experience without changing the core game loop.

lackoftactics
0 replies
6h21m

I think Alan's conservative countdown to AGI will need to be updated after this. https://lifearchitect.ai/agi/ This is really impressive stuff. I thought about it a couple of months ago, that probably this is the next modality worth exploring for data, but didn't imagine it would come so fast. On the other side, the amount of compute required is crazy.

kqr
0 replies
7h39m

I have been kind of "meh" about the recent AI hype, but this is seriously impressive.

Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.

kcaj
0 replies
13h55m

Take a bunch of videos of the real world and calculate the differential camera motion with optical flow or feature tracking. Call this the video’s control input. Now we can play SORA.

jumploops
0 replies
9h58m

This seems similar to how we use LLMs to generate code: generate, run, fix, generate.

Instead of working through a game, it’s building generic UI components and using common abstractions.

joseferben
0 replies
6h15m

impressive, imagine this but photo realistic with vr goggles.

itomato
0 replies
8h53m

The gibs are a dead giveaway

holoduke
0 replies
9h50m

I saw a video a while ago where they recreated actual doom footage with a diffusion technique so it looked like a jungle or anything you liked. Cant find it anymore, but looked impressive.

golol
0 replies
10h16m

Certain categories of youtube videos can also be viewed as some sort of game where the actions are the audio/transcript advanced a couple of seconds. Add two eggs. Fetch the ball. I'm walking in the park.

dean2432
0 replies
14h12m

So in the future we can play FPS games given any setting? Pog

ciroduran
0 replies
8h19m

Congrats on running Doom on an Diffusion Model :D

I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game

bufferoverflow
0 replies
13h47m

That's probably how our reality is rendered.

amunozo
0 replies
8h59m

This is amazing and an interesting discovery. It is a pity that I don't find it capable of creating anything new.

amelius
0 replies
9h26m

Yes, and you can use an LLM to simulate role playing games.

alkonaut
0 replies
5h3m

The job of the game engine is also to render the world given only the worlds properties (textures, geometries, physics rules, ...), and not given "training data that had to be supplied from an already written engine".

I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.

aghilmort
0 replies
3h21m

looking forward to &/or wondering about overlap with notion of ray tracing LLMs

acoye
0 replies
7h39m

Nvidia CEO reckons your GPU will be replaced with AI in “5-10 years”. So this is what the sort of first working game I guess.

acoye
0 replies
7h38m

I'd love to see John Carmack come back from his AGI hiatus and advance AI based rendering. This would be supper cool.

YeGoblynQueenne
0 replies
4h54m

Misleading Titles Are Everywhere These Days.

TheRealPomax
0 replies
3h18m

If by "game" you mean "literal hallucination" then yes. But if we're not trying to click-bait, then no: it's not really a game when there is no permanence or determinism to be found anywhere. It might be a "game-flavoured dream simulator", but it's absolutely not a game engine.

LtdJorge
0 replies
8h0m

So is it taking inputs from a player and simulating the gameplay or is it just simulating everything (effectively, a generated video)?

KETpXDDzR
0 replies
3h25m

I think the correct title should be "Diffusion Models Are Fake Real-Time Game Engines". I don't think just more training will ever be sufficient to create a complete game engine. It would need to "understand" what it's doing.