Video generation models as world simulators

I think people might be missing what this enables. It can make plausible continuations of video, with realistic physics. What happens if this gets fast enough to work _in real time_.

Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

You can probably already imagine different ways to wire the output to text generation and controlling its own motions, etc, and predicting outcomes based on actions it, itself could plausibly take, and choosing the best one.

It doesn't actually have to generate realistic imagery or imagery that doesn't have any mistakes or imagery that's high definition to be used in that way. How realistic is our own imagination of the world?

Edit: I'm going to add a specific case. Imagine a house cleaning robot. It starts with an image of your living room. Then it creates a image of your living room after it's been cleaned. Then it interpolates a video _imagining itself cleaning the room_, then acts as much as it can to mimic what's in the video, then generates a new continuation, then acts, and so on. Imagine doing that several times a second, if necessary.

Sounds like simulation theory is closer and closer to being proven.

Except there is always an original at the root. There’s no way to prove that’s not us.

The root world can spawn many simulations and simulations can be spawned within simulations. It becomes far more likely that we exist in a simulated world than in the root world.

This argument always bothers me.

The probability of being in any simulation is conditional on the one above, which necessarily decreases exponentially. Any simulation running an equivalent simulation will do so much slower, so you get a geometric series of degrading probabilities.

The rate of decay will be massive, imagine how long and how much resources it would take us to simulate our universe, even in a hand waved AI+lazy compute way that also spawns subconsciousnesses. The inverse of that is the sequences ratio.

So even in theory, the probability of you being in any of the simulated universes is P(we can simulate a universe) / (1-1/time to simulate) - prob we’re in the top universe.

Thinking this probability is overwhelming because of the nesting effect is false.

That's not a good argument, because we have no way of knowing what is the ration between our time and that of the universe where our simulation runs. Even if it takes hours to generate one second of our universe, we only experience our own time.

Besides, time is not absolute, and having it run slower near massive objects or when objects accelerate would be a neat trick to save on compute power needed to simulate a universe.

I wouldn’t have used the slower argument but rather the information encoding argument. A simulation must necessarily encode less information than the universe being simulated, in fact substantially less. It wouldn’t be possible to encode the same information as the root universe in any nested simulation, even the first level. That would require the root entire universe to encode. This information encoding problem is what geometrically gets worse as you nest. At a certain point the simulation must be so simplified and lossy of simulated information that it’s got no meaningful information and the simulation isn’t representative of anything.

However - just because probabilistic reasoning explodes and the likelihood of something is vanishing doesn’t make it true.

First, the prior assumption is the universe can be simulated in any meaningful way at any substantial scale. That’s not at all obvious that the ability to simulate is high enough fidelity to lead to the complexity we see around us without some higher dimensional universe simulating what we see and the realities we see are achieved through dimensional reduction and absurdly powerful technology. This is also a probability in the conditional probability and I would not put it at 1.0. I would actually make it quite small, but its term as a prior will be significant.

Second, the prior is that whatever root universe that exists has yet to achieve the simulation in the flow of time, assuming time started at some discrete point. Our observations lead us to conclude time and space both emerged at a discrete point. The coalescence of the modern universe, evolution of life itself, emergence of intelligent beings, the technology required to simulate an enormous highly complex universe in its entirety, etc, are all priors. These are non trivial factors to consider and greatly reduce the likelihood of the simulation theory.

Third, it’s possible the clock rate of the simulation is fast enough that the simulation operates much faster than time evolves in the root universe, but to the original posts point, without enormous lossy optimizations, the nested universes can’t run at a faster clock rate in their simulation than the first level simulation. This is partially related to the information encoding problem but not directly. I don’t agree it geometrically gets worse, but it doesn’t get better either without further greatly reducing the quality of the simulation. That means either the quality converges to zero very fast, or they run at a synchronicity of the first level universe, requiring 1:1 time. Assuming it actually simulates the universe and not just some sort of occlusion scoped to you as an individual, that might mean it’ll take billions of years within the first level simulation for each layer of the nesting. This seems practically unlikely even in a simulated universe, so either those layers must not achieve a nesting or they must converge to simulations that have lost so much fidelity they simulate nothing very quickly.

Well first you imply a base universe is finite. That is not a given at all.

You don't need to simulate the full universe. Just the experience of consciousness inside it. You don't even have to simulate full consciousness for every 'conscious' being. In fact, I've always seen the simulation argument as a thought experiment arguing for consciousness being more fundamental than matter. There is no need to imagine a human made computer simulating an entire universe in subatomic detail for this thought experiment to intrigue us.

We being able to pinpoint a start of all time is actually a pretty good argument for it being simulated. Why would we be able to calculate a 'start time' for reality? That is not obvious to be a necessity at a base universe at all. There are theoretical cosmologies out there that do away with that need to conceptualize a universe.

The simulated universe doesn't have to run time faster then 'real time' at the base universe at all. In fact, running slower would be a feature if the beings in the base universe wished to escape into the simulation for whatever reason.

The only thing about the branching simulations is they are likely simplified approximations. There’s no reason it doesn’t nest and that the approximations can observe their approximations of the prior level is strictly more complex than can be observed in the simulation. That should be fundamentally impossible meaning any branch can’t know if they’re the root or the branch, only that they create a branch.

Haven't you heard of "turtles all the way down"?

Something something linked list with loop.

Our ability to build somewhat convincing simulations of thing has never been a proof of living in a simulation…

i mean everyone's mind builds a convincing internal simulation of reality and it's so good that most people think they're directly experiencing reality.

So what happens to someone suffering a psychotic episode, their reality gets distorted? But what reality though if it’s all an internal simulation? I think there’s partly an internal simulation to some aspect of reality but there’s a lot more to it.

The world model is not the world. It's the old map and territory thing.

Buddhist insight meditation actually proves that’s not true, fwiw.

Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

In theory, yes. The problem is we've had AGI many times before, in theory. For example, Q learning, feed the state of any game or system through a neural network, have it predict possible future rewards, iteratively improve the accuracy of the reward predictions, and boom, eventually you arrive at the optimal behavior for any system. We've know this since... the 70's maybe? I don't know how far Q-learning goes back.

I like to do experiments with reinforcement learning and it's always exciting to think "once I turn this thing on, it's going to work well and find lots of neat solutions to the problem", and the thing is, it's true, that might happen, but usually it doesn't. Usually I see some signs of learning, but it fails to come up with anything spectacular.

I keep watching for a strong AI in a video game like Civilization as a sign that AI can solve problems in a highly complex system while also being practical enough that game creators are able to implement it in a practical way. Yes, maybe, maybe, a team with experts could solve Civilization as a research project, but that's far from being practical. Do you think we'll be able to show an AI a video of people playing Civilization and have the video predict the best moves before the AI in the game is able to predict the best moves?

I’ve been dying for someone to make a Civilization AI.

It might not be too crazy of an idea - would love to see a model fine-tuned on sequences of moves.

The biggest limitation of video game AI currently is not theory, but hardware. Once home compute doubles a few more times, we’ll all be running GPT-4 locally and a competent Civilization AI starts to look realistic.

I am 100% certain that the training of such an AI will result in winning a game without ever building a single city* and 1,000 other exploits before being nerfbatted enough to play a 'real' game.

(That doesn't mean I don't want to see the ridiculousness it comes up with!)

* https://www.youtube.com/watch?v=6CZEEvZqJC0

I knew it, I knew it! It would be a Spiffing Brit video.

That guy is a genius at finding exploits in computer games. I don't know how he does it, I think you need to play a fair bit of each game before you find these little corners of the ruleset.

Idk maybe he uses some sort of fuzzer

But wouldn't this be amazing for the developer to fix a lot of edge cases/bugs?

Maybe, maybe not. The stochastic, black-box nature of the current wave of ML systems gives me a gut feeling that using them like this is more of a Monkey's Paw wish granter than useful tool without a lot of refinement first. Time will tell!

If you train the model purely based on win rate, sure. Fortunately, we can efficiently use RLHF to train a model to play in a human-like way and give entertaining matches.

I think it's also a matter of "shape". Like, GPT4 solves one "shape" of problem, given tokens, predict the next token. That's all it does, that's the only problem it has to solve.

A Civilization AI would have many problem "shapes". What do I research? Where do I build my city, what buildings do I build, how do I move my units, what units do I build, what improvements do I build, when do I declare war, what trade deals do I accept, etc, etc. Each of those is fundamentally different, and you can maybe come up with a scheme to make them all into the same "shape", but then that ends up being harder to train. I would be interested to see a good solution to this problem.

You can constrain LLMs (like LLAMA) to only output certain tokens that match some schema (e.g. valid code syntax).

I don't see why you can't get a LLM to output something like "research tech332; build city3 building24".

I’ve been dying for someone to make a Civilization AI.

Would love to see someone to make an AI that can predict our economy, perhaps by modeling all the actors that participate in the economy using AI agents.

Tbh I don't think an AI for Civ would that impressive, my experience is that most of time you can get away with making locally optimal decisions I.e growing your economy and fighting weaker states. The problem with current civ AI is that their economies can be often structured nonsensically, but optimized economies is usually just the matter of stacking bonuses together into specialized production zones, which can be solved via conventional algorithms.

Maybe, but a lot of people would like better AIs in strategy video games, it only adds to the frustration when people say "it wouldn't be that impressive". It's like saying "that would be easy... but it's not going to happen." (And I'm not focused on Civilization, it's just a well known example, I'd like to see a strong AI in any similar strategy game.)

I think it might be harder than StarCraft or Dota. Civilization is all about slow decision making (no APM advantages for the AI), and all the decisions are quite different, and you have to make them in a competitive environment where an opponent can raid and capture your cities.

The problem with game AI is that they "cheat". They don't play like a human. The civ AI straight up gets extra resources, AlphaStar in SC2 performed inhuman feats like giving commands in two different areas of the map simultaneously or spiking actions per minute to inhuman levels briefly. But even with all of that the AI still eventually loses. And then they start losing consistently as players play more against them.

Why? Because AI doesn't learn on the fly. The AI does things a certain way and beating it becomes a puzzle game. It doesn't feel like playing against a human opponent (although AlphaStar in SC2 probably came pretty close).

Learning on the fly is probably the biggest thing that (game) AI is lacking in. I'm not sure there's an easy solution for it.

> Usually I see some signs of learning, but it fails to come up with anything spectacular.

And even if it succeeds, it fails again as soon as you change the environment because RL doesn't generalise. At all. It's kind of shocking to be honest.

https://robertkirk.github.io/2022/01/17/generalisation-in-re...

You're talking about an agent with a world model used for planning. Actually generating realistic images is not really needed as the world model operates in its own compressed abstraction.

Check out V-Jepa for such a system: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...

V-Jepa is actually super impressive. I have nothing but respect for Yann LeCun & his team, they really have been on a rampage lately.

One more French brain we didn't manage to keep.

The drain is just crazy at this point.

What would fix the issue?

Based on talking to my coworkers from Europe on the West Coast (some have more nuanced position, but some were outright "everyone in tech in their right mind moves away from Europe"), nothing short-term.

If you forget specifics, and consider on the abstract level what the differences are... Let's say there was an equal pile of "resources" per person available in Europe and the US. The way this pile is (abstractly) distributed in Europe is egalitarian and safety-net focused; in the US it is distributed more unequally, closer to some imperfect approximation of merit. Most of the (real) advantages and disadvantages that people bring up for the US stem from that. The more of this approximation of merit you have, the more "resources" you'd have in US. No matter what the specific slopes are, unless one place is much richer (might be the US anyway), at some point these lines cross. The higher the person is above this point the more it makes sense for them to go to the US...

There are also 2nd order effects like other people above that point having already gone (not just from Europe, from everywhere in the world), making the US more attractive, probably. That might matter more for top talent.

And although this probably doesn't matter for the top talent, "regular" Europeans can actually have the cake and eat it too - make the money in the US, then (in old age or if something happens) move back home and avail themselves of the welfare state. A non-German guy who worked in Germany for a few years told me that's what he'd do if he was German - working in Germany sucks, but being lazy in Germany is wonderful, so he'd move to the US then move back ;)

Is it an issue that needs fixing?

Not sure.

It's a lot of things at this point:

- If you are skilled, the money is much better in the US, even after paying for health care yourself.

- French entities and investors are risk adverse. This means your original projects will get canned more often, funding is going to be super hard, and will bring less money if you succeed.

- The French speaking market is smaller, so whatever you try, if you try in France, you either you target a market outside of France, which is harder than be where you clients are, or your target France and French speaking countries, with a much lower pay off.

- Customer and worker protection is higher, and laws are everywhere. This is usually good for citizens, but of course, it also means Uber or airbnb could never have __started__ in France.

- The network effect means if you go to the US, you will meet more opportunities, more skilled people and more interesting projects. There is also an energy there you won't find elsewhere.

- Administration is heavy, for companies, it's of course a burden, but for universities, it's a nightmare. And they are really under paid. Not to mention academics in France have a hard time promoting an idea, an innovations or anything they came up with. While in the US things have a catchy name before they are even proven to work.

All those things mean the US is professionally highly attractive, and actively trying to get talents with the resource to pay for it and the insistence of their market pressure.

Who is "we"?

Nous.

The people of France presumably.

Do you have a list ?

I don't, I just seem to have this moment of "oh, him as well" regularly.

And I get it, I went to the valley as well for some times, the money is better, the taxes are lower, you get more opportunities, meet more talented people and projects are way cooler.

What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.

Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.

It's just that the equivalent data is not as easily available on the internet :)

That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.

Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.

There are also models that are trained to generate 3D models from a picture. Use it on videos, and also train it on output generated by video games.

Depth estimation improved a lot as well e.g. with Depth-Anything [0]. But those are mostly relative depth instead of metric. Also when even converted to metric they still seems have a lot of pointclouds at the edges that have to be pruned - visible in this blog [1]. Looks like those models trained on Lidar or Stereo depthmaps that has this limitations. I think we don't have enough clean training data for 3d unless we maybe train on synthetic data (then we can have plenty, generate realistic scene in Unreal Engine 5 and train on rendered 2d frames)

[0] https://github.com/LiheYoung/Depth-Anything

[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...

imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)

A 3d model with object permanency is definitely a step in the right direction of something or other but for clarity let us dial back down the level of graphical detail.

A Pacman bot is not AGI. Might get it to eat all the dots correctly where as before if something scrolled off the screen it'd forget about it and glitch out - but you didn't fan any flames of consciousness into existence as of yet.

Is a human that manages to eat all the skittles and walk without falling into deadly holes AGI? Why?

Object permanence is a necessary but not sufficient condition for spatial reasoning but the definition of consciousness remains elusive unless you have some news to share.

A blind human is AGI. So is a drunk or clumsy one that falls into deadly holes. This is super cool and a step on the way to... something.. but even the authors don't claim it is somehow the whole ballgame.

I totally agree that a system like Sora is needed. By itself, it’s insufficient. With a multimodal model that can reason properly, then we get AGI or rather ASI (artificial super intelligence) due to many advantages over humans such as context length, access to additional sensory modalities (infrared, electroreception, etc), much broader expertise, huge bandwidth, etc.

future successor to Sora + likely successor to GPT-4 = ASI

See my other comment here: https://news.ycombinator.com/item?id=39391971

I call bullshit.

A key element of anything that can be classified as "general intelligence" is developing internally consistent and self-contained agency, and then being able to act on that. Today we have absolutely no idea of how to do this in AI. Even the tiniest of worms and insects demonstrate capabilities several orders of magnitude beyond what our largest AIs can.

We are about as close to AGI as James Watt was to nuclear fusion.

A definition of general intelligence may or may not include agency to act. There is no consensus on that. To learn and to predict, yes, but not necessarily to act.

Does someone with Locked-In Syndrome (LIS) continue to be intelligent? I’d say yes.

Obviously, agency to act might be instrumental for learning and predicting especially early in the life of an AI or a human, but beyond a certain point, internal simulations could substitute for that.

This comment is brilliant. Thank you. I’m so excited now to build a bot that uses predictive video. I wonder what the most simple prototype would be? Surely one that has a simple validation loop that can say hey, this predicted video became true. Perhaps a 2D infinite scrolling video game?

Imagine having real-time transfer of characteristics within your world in a VR/mixed reality setup. Automatically generating new views within the environment you are currently in could create pretty interesting experiences.

Imagine putting on some AR goggles

staring at a painting in a Museum

Then immediately jumping into an entire VR world based off the painting generated by an AI rendering it out on the fly

BlockadeLabs has been doing a 3D text to skybox and not exactly runtime at the moment but I have seen it work in a headset and it definitely feels like the future.

That’s how we think:

Imagine where you want to be (eg, “I scored a goal!”) from where you are now, visualize how you’ll get there (eg, a trick and then a shot), then do that.

Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting

There was that article a few months ago about how basically that's what the cerebellum does.

So basically a brain in a vat, reality as we experience it, our thoughts as prompts.

Figure out how to incorporate a quantum computer as a prediction engine in this idea, and you've got quite the robot on your hands. :)

(and throw this in for good measure https://www.wired.com/story/this-lab-grown-skin-could-revolu... heh)

This sounds like it has military applications, not that I’m excited at the prospect.

FWIW, you've basically described at a high level exactly what autonomous driving systems have been doing for several years. I don't think anyone would say that Waymo's cars are really close to AGI.

Adding to this: Sora was most likely trained on video that's more like what you'd normally see on YouTube or in a clip art or media licensing company collection. Basically, video designed to look good as a part of a film or similar production.

So right now, Sora is predicting "Hollywood style" content, with cuts, camera motions, etc... all much like what you'd expect to see in an edited film.

Nothing stops someone (including OpenAI) from training the same architecture with "real world captures".

Imagine telling a bunch of warehouse workers that for "safety" they all need to wear a GoPro-like action camera on their helmets that record everything inside the work area. Run that in a bunch of warehouses with varying sizes, content, and forklifts, and then pump all of that through this architecture to train it. Include the instructions given to the staff from the ERP system as well as the transcribed audio as the text prompt.

Ta-da.

You have yourself an AI that can control a robot using the same action camera as its vision input. It will be able to follow instructions from the ERP, listen to spoken instructions, and even respond with a natural voice. It'll even be able to handle scenarios such as spills, breaks, or other accidents... just like the humans in its training data did. This is basically what vehicle auto-pilots do, but on steroids.

Sure, the computer power required for this is outrageously expensive right now, but give it ten to twenty years and... no more manual labour.

The flip side of video or image gen is always video or image identification. If video gets really good then an AI can have quite an accurate visual view into the world in real time

how would you define AGI?

Thanks for adding the specific case. I think with testing these sort of limited domain applications make sense.

It'll be much harder for more open ended world problems where the physics encountered may be rare enough in the dataset that the simulation breaks unexpectedly. For example a glass smashing into the floor. The model doesn't simulate that causally afaik

> Connect this to a robot that has a real time camera feed. Have it constantly generate potential future continuations of the feed that it's getting -- maybe more than one. You have an autonomous robot building a real time model of the world around it and predicting the future. Give it some error correction based on well each prediction models the actual outcome and I think you're _really_ close to AGI.

As another comment points out that's Yann LeCun's idea of "Objective-Driven AI" introduced in [1] though not named that in the paper (LeCun has named it that way in talks and slides). LeCun has also said that this won't be achieved with generative models. So, either 1 out of 2 right, or both wrong, one way or another.

For me, I've been in AI long enough to remember many such breakthroughs that would lead to AGI before - from DeepBlue (actually) to CNNs, to Deep RL, to LLMs just now, etc. Either all those were not the breakthroughs people thought at the time, or it takes many more than an engineering breakthrough to get to AGI, otherwise it's hard to explain why the field keeps losing its mind about the Next Big Thing and then forgetting about it a few years later, when the Next Next Big Thing comes around.

But, enough with my cynicism. You think that idea can work? Try it out. In a simplified environment. Take some stupid grid world, a simplification of a text-based game like Nethack [2] and try to implement your idea, in-vitro, as it were. See how well it works. You could write a paper about it.

____________________

[1] https://openreview.net/pdf?id=BZ5a1r-kVsf

[2] Obviously don't start with Nethack itself because that's damn hard for "AI".

I like that this one shows some "fails", and not just the top of the top results:

For example, the surfer is surfing in the air at the end:

https://cdn.openai.com/tmp/s/prompting_7.mp4

Or this "breaking" glass that does not break, but spills liquid in some weird way:

https://cdn.openai.com/tmp/s/discussion_0.mp4

Or the way this person walks:

https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Or wherever this map is coming from:

https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

That is creepy...

Take a look at these adorable kangaroos to relax:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-...

Is the guy with the arm sticking out of his shoulder the relaxing part? :-)

I think you forgot the most adorable one /s

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

no but for real this seems to be the most "adorable" one:

https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-bl...

Where do you find the last two?

Part of this website changes after the video finished and switches to the next video. There is no way to control it. These are both "X wearing Y taking a pleasant stroll in Z during W"

nvm you can control it, when you click on the variables, there is a dropdown to select values for the variables.

Or the way this person walks:

https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-a...

Also, why does she have a umbrella sticking out from her lower back?

I suppose the lady usually has an umbrella in this kind of situation, so it felt the umbrella should be included in some way: https://youtu.be/492tGcBP568

In truth, that's not a woman in a green dress, it's a bunch of penguins disguised as a woman in a green dress. That explains the gait. As to the umbrella, they assumed that humans, intelligent as we are, always carry polar bear protection around.

For example, the surfer is surfing in the air at the end

Maybe it’s been watching snowboarding videos and doesn’t quite understand the difference.

Or wherever this map is coming from:

https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls...

Notice also that that a roughly 6 seconds there is a third hand putting the map away.

The hyper realistic and plausible movement of the glass breaking makes this bizarrely fascinating. And it doesn’t give me the feeling of disgust the motion in the more primitive AI models did

I've also noticed on some of the featured videos that there are some perspective/parallax errors. The human subjects in some are either oversized compared to background people, or they end up on horizontal planes that don't line up properly. It's actually a bit vertigo-inducing! It is still very remarkable

I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722

You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.

It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.

You can literally run 3D algorithms like NeRF or COLMAP on those videos (check the tweet I sent), it's not my opinion, those videos are sufficiently 3D consistent that you can extract 3D geometry from them

Surely it's not perfect, but this was not the case for previous video generation algorithms

Yeah, it seems to have a hard time processing lens distortion in particular which gives a very weird quality. It's actually bending things, or trying to fill in the gaps, instead of distorting the image in the "correct" way.

I wonder how much it'd improve if trained on stereo image data.

Moving camera is just stereo.

I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/

The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.

Yeah we were surprised by that, video models are great 3d prices and image models are great video model priors

This is also true for 2D diffusion models[1]. I suppose they need to understand how 3d works for stuff like lighting/shadows/object occlusion, etc.

[1] https://dreamfusion3d.github.io/

That leaves me wondering if it'd be possible to get some variant of the model to directly output 3D meshes and camera animation instead of an image.

AlphaGo and AlphaZero were able to achieve superhuman performance due to the availability of perfect simulators for the game of Go. There is no such simulator for the real world we live in (although pure LLMs sort of learn a rough, abstract representation of the world as perceived by humans.) Sora is an attempt to build such a simulator using deep learning.

  “Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.”

General, superhuman robotic capabilities on the software side can be achieved once such a simulator is good enough. (Whether that can be achieved with this approach is still not certain.)

Why superhuman? Larger context length than our working memory is an obvious one, but there will likely be other advantages such as using alternative sensory modalities and more granular simulation of details unfamiliar to most humans.

Really interesting how this goes against my intuition. I would have imagined that it's infinitely easier to analyze a camera stream of the real world, then generate a polygonal representation of what you see (like you would do for a videogame) and then make AI decisions for that geometry. Instead the way that AI is going they rather skip it all and work directly on pixel data. Understanding of 3d geometry, perspective and physics is expected to evolve naturally from the training data.

then generate a polygonal representation of what you see

It's really not that surprising since, to be honest, meshes suck.

They're pretty general graphs but to actually work nicely they have to have really specific topological characteristics. Half of the work you do with meshes is repeatedly coaxing them back into a sane topology after editing them.

Do we have anything better than meshes that is as generally useful though?

Another instance of the bitter lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

There is a perfect simulator of the real world available. It can be recorded with a camera! Once the researchers have a bit of time to get their bearings and figure out how to train an order of magnitude faster we'll get there.

That's still not a simulation if camera recording shows only what we see.

Damn, even minecraft videos being simulated, this is crazy to see from OpenAI.

Edit, changed the links to the direct ones!

https://cdn.openai.com/tmp/s/simulation_6.mp4

https://cdn.openai.com/tmp/s/simulation_7.mp4

But it starts to make sense, when you think about the fact that Minecraft is owned by Microsoft.

example video links from TFA:

https://cdn.openai.com/tmp/s/simulation_6.mp4

https://cdn.openai.com/tmp/s/simulation_7.mp4

I don't think that has anything to do with it. There's just millions of hours of minecraft video on youtube and twitch.

But If they wouldn't have the ok from Microsoft there would be a chance that they start talking about copyright infringement.

As someone who's played probably too many hours of minecraft, these videos are nauseating. The way that all of the individual pieces exist, but have no consistency is terrifying. Random FoV changes, switching apparent texture packs, raytracing on or off, it's all still switching back and forth from moment to moment.

These videos honestly give me less confidence in the approach, simply because I don't know that the model will be able to distinguish these world model parameters as "unchanging".

As someone who is a human born in the 80s, this shit wasn't supposed to even be theoretically possible.

People didn't think this was possible even a year ago.

"Gives me less confidence"

Cmon man...

Should I short all the 3d tool s/movies/vfx companies?

Arguably you should go long since once they integrate this into their products (as Adobe is doing) they have the distribution in place to monetise it, industry knowledge to combine it with existing workflows, etc.

People working in vfx are incredibly gloomy today, they see the writing on the wall now, whether it's 1 or 5 years. There will still be a demand for human-created stuff but many of the jobs in advertising and stock footage will disappear.

I think not in the short term- I'm guessing the next step will be to use traditional tools to make a "draft" of a desired video, then "finish it off" with this kind of deep learning tech.

So this tech will increase interest in existing 3d tools in the short term

Okay, The Matrix can't be too far away now.

Maybe before long they can make whole movies just with prompts giving a plot outline. Then they can finally make a sequel to The Matrix.

Finding Nemo except they're all in BDSM outfits

Having the same cat enter the shot twice would definitely be a glitch in the Matrix.

If they would allow this (maybe a premium+ model) they could soon destroy the whole porn industry. not the websites, but the (often abused) sex workers. Everyone could describe that fetish they are into and get it visualized instantly without the need of physical human suffering to produce these videos.

I know its a delicate topic people (especially in the US) don't want to speak about at all, but damn, this is a giant market and could do humanity good if done well.

There are thousands of porn consumers with destroyed reward circuitry per every porn actor, of which few are mistreated and the majority are compensated very well.

Producing a neverending supply of wireheading-like addictive stimuli is the farthest possible thing from a good for humanity.

Want to do good in this area - work on ways to limit consumption.

This is easily proven inaccurate. The largest network of sex workers is OnlyFans. Go look up stats for OF creators. It's 99.5% people making basically zero money, and 0.01% making 99% of the money.

the reaction of a pedo or worse lol

Other interactions, like eating food, do not always yield correct changes in object state

So this is why they haven't shown Will Smith eating spaghetti.

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world

This is exciting for robotics. But an even closer application would be filling holes in gaussian splatting scenes. If you want to make a 3D walkthrough of a space you need to take hundreds to thousands of photos with seamless coverage of every possible angle, and you're still guaranteed to miss some. Seems like a model this capable could easily produce plausible reconstructions of hidden corners or close up detail or other things that would just be holes or blurry parts in a standard reconstruction. You might only need five or ten regular photos of a place to get a completely seamless and realistic 3D scene that you could explore from any angle. You could also do things like subtract people or other unwanted objects from the scene. Such an extrapolated reconstruction might not be completely faithful to reality in every detail, but I think this could enable lots of applications regardless.

Do note that "reconstruction" is not the right word, the proper characterisation of that sort of imputation is "artist impression": good for situations where the precise details doesn't matter. Though of course if the details doesn't matter maybe blurry is fine.

Well, yeah, if the details don't matter then you don't need "highly-capable simulators of the physical and digital world". And if the details do matter, then good luck having a good enough simulation of the real world that you can invoke in real time in any kind of mobile hardware.

While the Sora videos are impressive, are these really world simulators? While some notion of real-world physics probably exists somewhere within the model, doesn’t all the completely artificial training data corrupt it?

Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.

This is just a contrived, interesting viewpoint of the technology, right?

“it does not accurately model the physics of many basic interactions, like glass shattering.”

Reasoning, logic, formal systems, and physics exist in a seemingly completely different, mathematical space than pure video.

That's not true, AI systems in general have pretty strong mathematical proofs going back decades on what they can theoretically do, the problem is compute and general feasibility. AIXItl in theory would be able to learn reasoning, logic, formal systems, physics, human emotions, and a great deal of everything else just from watching videos. They would have to be videos of varied and useful things, but even if they were not, you'd at least get basic reasoning, logic, and physics.

Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

so they're gonna include the never-before-observed-but-predicted Unruh effect, as well? and other quantum theory? cool..

For example, it does not accurately model the physics of many basic interactions, like glass shattering.

... oh

Isn't all of the training predicated on visible, gathered data - rather than theory? if so, I don't think it's right to call these things simulators of the physical world if they don't include physical theory. DFT at least has some roots in theory.

The point isn't "physical simulator" like supercomputers, it's "physical simulator" like the human brain.

yeah, well maybe they should've said that, and they obviously would have if they intended to distinguish between the two of them instead of over inflating their claims in order to raise more money

The simple counter argument is that they're claiming that there is potential for general purpose physical simulation. So please don't play with words. I thought we were supposed to be talking about something concrete instead of whatever you wanted to twist things into to suit your purpose

either that, or they're not qualified to distinguish between the two options you presented, which is obviously not true

right?

the article's bazillion references have nothing to do with physics

nor is there any substance behind the claim of physical simulation

what is actually being simulated is synthesis of any video

it's not that complicated and you and they must know it too

it's double speak easily seen from miles away

scammers have to downvote

Maybe this says more about me than about the technology, but I found the consistency of the Minecraft simulation super impressive.

I was wondering how feasible it would be to make a Minecraft agent that had a running feed of the past few seconds, continued it off w/ SORA, fed the continuation into a (relatively) simple policy translator that just pulled out what the video showed as player inputs, and the inputted that.

Presumably, this would work for non-minecraft applications, but Minecraft has a really standardized interface layer

It’s interesting to think about what the internal model of a NN of a game might be like, and whether it could be exposed with an API that someone could program against.

If there's one thing I've always wanted, it's shitty video knockoffs of real life. Can't wait to stream some AI hallucinations.

As if what you consume normally is actual real life. With blue screens, VFX, etc. you are already watching knockoffs of real life, and the shitty will become indistinguishable from reality before long.

So is the only thing you watch on TV, unedited documentary footage?

Also - ironic choice of username considering this comment!

The Minecraft demo makes me think that soon will be playing games directly from the output of one of these models, unlimited content.

While it seems plausible that eventually you could build a game around one of these models, the lack of an underlying state representation that you can permute in a precise way is a pretty strong barrier to anything resembling real user-and-system interaction. Even expressing pong through text prompts in a way that would produce desirable results in this is a tough challenge.

I could imagine a text adventure game with a 'visual' component perhaps working if you got the model to maintain enough consistency in spaces and character appearances.

Imagine something like GAN Theft Auto V powered by this technology:

https://www.youtube.com/watch?v=udPY5rQVoW0

Does anyone know why most of the videos are in slow motion?

I'm guessing that either the shots its trained on have a propensity to be slo-mo (like puppies playing in a field) or making it slow-motion makes unnatural movement a lot less obvious

Ugh, AI generated images everywhere is already annoying enough. Now we're gonna have these factitious videos clogging up everything, and I'll have to explain my old neighbor that Biden did infact not eat a fetus again and again.

100%. It's actually gotten even more dull once they started fixing the fingers. But it's too much; you start to realise that it's just so uninspired. Maybe what this will ultimately do is allow good writers to bring their ideas to life (I hope).

This is insanely good but look at the legs around 16 seconds in, they kinda morph through each other. Generally the legs are slightly unnerving.

Still, god damn.

Yeah I saw that too, it does it in one other place, the feet are also sort of gliding around. If you look at the people in the background a lot of them are doing the same, and there are other temporal-mechanical inconsistencies like joints inverting half way through a movement, i guess due to it operating in 2D so when things are foreshortened from the camera angle they have the opportunity to suddenly jump into the wrong position, like twitchy inverse kinematics.

Everything also has a sort of mushy feel about it, like nothing is anchored down, everything is swimming. Impressive all the same, but maybe these methods need to be paired with some old fashioned 3d rendering to serve as physically based guideline.

I am a newbie to this area. Honest questions:

Is this generating videos as streaming content e.g. like a mp4 video. As far as I can see, it is doing that. Is it possible for AI to actually produce the 3d models?

What kind of compute resources are required to produce the 3d models.

You can feed the output to NeRF or Gaussian Splat generators to produce 3d models:

https://twitter.com/BenMildenhall/status/1758224827788468722

https://twitter.com/ScottieFoxTTV/status/1758272455603327455

The key is that the video has spatial consistency. Once you've got that, then other existing tech can take the output and infer actual spatial forms.

SORA.....the entire movie industry is now out of a job.

Did you actually read the post? This is not the case, yet.

Also the concept of learning to simulating the world seems more important than just the media and content implications.

Video will be especially important for language models to grasp physical actions that are instinctive and obvious to humans but not explicitly detailed in text or video captions. I mentioned this in 2022:

https://twitter.com/LechMazur/status/1607929403421462528

https://twitter.com/LechMazur/status/1619032477951213568

The improvement to temporal consistency given that the length of these generated videos is 3 to 4 times longer than anything else on the market (runway, pika, etc) is truly remarkable.

I don't know if there is research into this, didn't see it mentioned here, but this is the most probable path to something like AI consciousness and AGI. Of course it's highly speculative but video to world simulation is how the brain evolved and probably what is needed to have a robot behave like a living being. It would just do this in reverse, video input to inner world model, and use that for reasoning about the world. Extremely fascinating, and also scary this is happening so quickly.

The video with the two MTBs going downhill: it seems to me that the long left turn that begins a few second into the video is way too long. It's easy to misjudge that kind of things (try to draw a road race track by looking at a single lap of it) but it could end up below the point where it started, or too close to it to be physically realistic. I was expecting to see a right turn at any moment but it kept going left. It could be another consequence of the lack of real knowledge about the world, similar to the glass shattering example at the end of the article.

That's an interesting idea. Analogous to how LLMs are simply "text predictors" but end up having to learn a model of language and the world to correctly predict cohesive text, it makes sense that "video predictors" also have to learn a model of the world that makes sense. I wonder how many orders of magnitude further they have to evolve to be similarly useful.

This is a totally silly thought, but I still want to get it out there.

Other interactions, like eating food, do not always yield correct changes in object state

Can this be because we just don't shoot a lot of people eating? I think it is general advice to not show people eating on camera for various reasons. I wonder if we know if that kind of topic bias exists in the dataset.

Where's the training data come from? Youtube?

Wow, this is really just scale-up DiT. We are going to see tons of similar models very soon.

Yesterday I was watching a movie on Netflix and thought to myself, what if Netflix generated a movie based on what I want to see and what I like.

Plus, it could generate it in real time and take my responses into account. I look bored? Spice it up, etc.

Today such a thing seems closer than I thought.

This is some incredible and fascinating work! The applications seem endless.

1. High quality video or image from text 2. Taking in any content as input and generating forwards/backwards in time 3. Style transformation 4. Digital World simulation!

Watching an entirely generated video of someone painting is crazy.

I can't wait to play with this but I can't even imagine how expensive it must be. They're training in full resolution and can generate up to a minute of video.

Seeing how bad video generation was, I expected it would take a few more years to get to this but it seems like this is another case of "Add data & compute"(TM) where transformers prove once again they'll learn everything and be great at it

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.

Every cv preprocessing pipeline is in shambles now.

I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.

The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...

Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.

The current development of AI seems like speed run of Crystal Society in terms of their interaction with the world. The only thing missing is the Inner Purpose.

I know the main post has been getting a lot of reaction, but this page absolutely blew me away. The results are striking.

The robot examples are very underwhelming, but the people and background people are all very well done, and at a level much better than most static image diffusion models produce. Generating the same people as the interact with objects is also not something I expected a model like this to do well so soon.

Related ongoing thread:

Sora: Creating video from text - https://news.ycombinator.com/item?id=39386156 - Feb 2024 (1430 comments)

This is the second Sora announcement I've seen. Am I missing how I can play with it? The examples in the papers are all well and good but I want to get my hands on it and try it.

What makes OpenAI so far ahead of all of these other research firms (or even startups like Pika, Runway, etc.)? I feel like I see so many examples of fields where progress is being made all across and OpenAI suddenly swoops in with an insane breakthrough lightyears ahead of everyone else.

People are obviously already pointing out the errors in various physical interactions shown in the demo videos, including the research team themselves, and I think the plausiblity of the generated videos will likely improve as they work on the model more. However, I think the major reason this generation -> simulation leap might be harder leap than they think is actually a plausibility/accuracy distinction. Generative models are general and versatile compared to predictive models, but they're intrinsically learning an objective that assesses its extrapolations on spatial or sequential (or in the case of video, both) plausibility, which has a lot more degrees of freedom than accuracy. In other words, the ability to create reasonable-enough hypotheses for what the next frame or the next pixel over could end up not being enough. The optimistic scenario is that it's possible to get to a simulation by narrowing this hypothesis-space enough to accurately model reality. In other words, it's possible that this is just something that could fall out of the plausibility being continuously improved, like the subset of plausible hypotheses shrinks as the model gets better, and eventually we get a reality-predictor, but I think there are good reasons to think that's far from guaranteed. I'd be curious to see what happens if you restrict training data to unaltered camera footage rather than allowing anything fictitious, but the least optimistic possibility is that this kind of capability is necessary but not sufficient for adequate prediction (or slightly more optimistically, can only do so with amounts of resolution that are currently infeasible, or something).

Some of the reasons the less optimistic scenarios seem likely is that the kinds of extrapolation errors this model makes are of similar character to those of LLMs: extrapolation follows a gradient of smooth apparent transitions rather than some underlying logic about the objects portrayed, and sometimes seems to just sort of ignore situations that are far enough outside of what it's seen rather than reconcile them. For example, the tidal wave/historical hall example is a scenario unlikely to have been in the training data. Sure, there's the funny bit at the end where the surfer appears to levitate in the air, but there's a much larger issue with how these two contrasting scenes interact, or rather fail to. What we see looks a lot more like a scene of surfing superimposed via photoshop or something on a still image of the hall, as there's no evidence of the water interacting with the seats or walls in the hall at all. The model will just roll with whatever you tell it to do as best it can, but it's not doing something like modeling "what would happen if" that implausible scenario played out, and even doing it poorly would be a better sign for this doing something like "simulating" the described scenario. Instead, we have impressive results for prompts that likely strongly correspond to scenes the model may have seen, and evidence of a lack of composition in cases where a particular composition is unlikely to have been seen and needs some underlying understanding of how it "would" work that is visible to us