return to table of content

Lumiere: A space-time diffusion model for realistic video generation

55555
23 replies
2d6h

The negative comments here shock me. This is the most amazing text-to-video we've ever seen, by a longshot. It's good enough for many uses. Absolutely mindblowing. Great job to the people who worked on this.

danielbln
9 replies
2d6h

I'm excited about these text 2 video models, what I'm not excited about is that it's Google publishing this. That means, no code, nothing deployed to try and most likely we will never ever hear about this ever again, or maybe it quietly hits Vertex AI in 2 years (like Imagen) and no-one will care.

Also, ever since the Gemini marketing video shenanigans, I don't really feel like trusting whatever Google's research says they have, if I can't test it myself.

sjwhevvvvvsj
4 replies
2d2h

“PoC or GTFO” as they say.

addandsubtract
3 replies
2d1h

PapersWithCode or GTFO.

[0] https://paperswithcode.com/

baldgeek
2 replies
2d1h

2 clicks from the Posted Link: "Read Paper", then "Code, Data and Media" tab will get you the dataset used (https://paperswithcode.com/dataset/ucf101)

sjwhevvvvvsj
0 replies
1d23h

Well in the Ai/ML era maybe “models or gtfo” is better. Training data is just common crawl for half these LMs.

gs17
0 replies
23h45m

That's not the dataset used for training. From the paper:

We train our T2V model on a dataset containing 30M videos along with their text caption. [...] We evaluate our model on a collection of 113 text prompts describing diverse objects and scenes. The prompt list consists of 18 prompts assembled by us and 95 prompts used by prior works (Singer et al., 2022; Ho et al., 2022a; Blattmann et al., 2023b) (see App. B). Additionally, we employ a zero-shot evaluation protocol on the UCF101 dataset >
sho_hn
3 replies
2d6h

Also, ever since the Gemini marketing video shenanigans, I don't really feel like trusting whatever Google's research says they have, if I can't test it myself.

The video was released by Google product marketing for a launch to customers, not research.

I'm still somewhat confused by this one. I understand the community has decided to be harsh on Google for that video to draw a line - fair, truth in advertising, etc. -, but at the same time, we all had an understanding of where that tech is at currently and the pace it progresses at. Did anyone watching it really assume it was realtime? Can we not differentiate between technical publications and marketing anymore? Do we have to vilify everyone in an R&D department for the sins of the product marketing wing?

whywhywhywhy
1 replies
2d5h

We’re all harsh on it because we all had to see it being posted around by naive people as being amazing when it’s completely faked and misses half of the prompting, all of the latency.

It was completely dishonest. Considering how trash Googles actual AI products are they deserve to be dragged even more over that video.

IshKebab
0 replies
1d21h

How is it completely faked? The video didn't give me the impression that the results were calculated instantly, or that no prompts were required.

ImprobableTruth
0 replies
2d5h

The issue isn't that it's not real-time/sped-up, it's that it doesn't actually take video as input, but multiple hand picked stills.

whywhywhywhy
5 replies
2d5h

How can you get excited about a company that’s shipped enough pieces of research you can count them on one hand during the 12 or so years they’ve been telling us about their work.

Google can publish whatever research they want, literally doesn’t matter literally changes nothing because they can’t turn it into a product anyone can use and never will.

ryandvm
2 replies
2d4h

Indeed. These recent AI demos are pretty damn impressive (even knowing there are smoke and mirrors), but it's hard to get excited about what's happening with their R&D when my Google Home device seems to be regressing on a daily basis. It is now basically is only useful for alarms and timers.

lbeltrame
1 replies
2d2h

Perhaps OT but I see often these comments on HN. How do these devices (I don't own one) lose functionality over time? Features removed through updates?

pbronez
0 replies
2d2h

Yes.

The home assistant speakers aren’t making enough money to justify the large teams behind them. Thus we’ve seen significant layoffs on those teams in the past year.

BigCos are looking for other ways to reduce costs. Killing features is one way to do it.

There have also been situations where a feature is removed because of legal action; lawsuits alleging the features violates a patent.

Live updates givith, live updates taketh away!

smoldesu
0 replies
2d2h

I'm excited because I despise Google's products anyways and would rather use the research myself. Did that with Google's BERT model a few years back to make a particularly clueless Discord bot.

sjwhevvvvvsj
0 replies
2d2h

Hey now, Google is going to use these technologies to fire their own employees to save costs for the next quarterly earnings call!

Of course building products for actual users is no longer a “thing”, but think of the stock price.

boesboes
1 replies
2d6h

What negative comments?

saurik
0 replies
2d5h

Yeah... I have only found a single negative comment about the quality--"As realistic as my blurry dream."--and it comes across as more of a cynical joke than a true negative review.

tomcam
0 replies
2d4h

Agreed. And the dancing bear is an instant classic.

malka
0 replies
2d1h

It's by google. It will rot somewhere, never to be used.

heisgone
0 replies
2d1h

It's indeed impressive. Stable diffusion is progressing so fast. That being said, I find myself picking more and more cues that an images is AI-generated. There is a feel to it. It's not different than the best movie CGI. As Christopher Nolan pointed-out, no matter how good it is, it's not the real deal.

eurekin
0 replies
2d6h

Just chiming in to second your opinion.

Years ago, I wouldn't even dare to dream it would be possible. It's nowhere near, what people are used to watch normally, but the fact it's even trying to compete is insane.

endisneigh
0 replies
2d3h

Well it’s Google and a lot of folks foam at the mouth at Google so it’s no surprise. They can’t dissociate the research from the creator.

smusamashah
18 replies
2d7h

The examples are lot more consistent and longer than other techniques we have seen before. Legs are not sliding on the floor as much as they do with other models. On the other hand, human faces didn't look good. e.g. the mona lisa smiling.

To me this looks like the first good video generation model.

EDIT: Just noticed its by Google, NVM, will never be released publicly.

i-use-nixos-btw
12 replies
2d6h

If it were to be released publicly, I'd give it a week before NSFW models based on it were uploaded to Civitai.

Frost1x
9 replies
2d6h

A lot of current AI techniques are making people reevaluate their perspectives on free speech.

We seem to value freedom of speech (and expression) only to a tipping point that it begins to invade other aspects of life. So far the noise and rate has been low enough people at large support free speech but newer information techniques are making it possible to generate a lot more realistic noise (faux signal, if you will) at higher rates (it’s becoming cheaper and easier to do and scale).

So while you certainly have a point I mostly agree with, we’re letting private entities policies dictate the limitations of expression, at least for the time being (until someone comes along and makes these widely available for free or cheap without such ethical policies). It does go to show just how much sway industries have on markets through their policies with no public oversight, which to me is concerning.

kjqgqkejbfefn
7 replies
2d4h

I've been experimenting with story generation/RP with ChatGPT and now use jailbreaks systematically because it makes the stories so much better. It's not just about what's allowed or not, but what's expressed by default. Without jailbreaks ChatGPT will always give narration a positive twist, let alone inject the same sponsored themes of environmentalism and feminism. Nothing wrong with that. But I don't want 1/3rd of my stories to revolve around these thematics.

kridsdale1
1 replies
2d4h

Similarly I wanted to use it to illustrate my friend’s wizard character using a gorgon head to freeze some giant evil bees.

The OpenAI content policies are pretty strictly opposed to the holding and wielding of severed heads.

lbeltrame
0 replies
2d2h

I got lectured by Bard when I asked about help to improve the description of an action scene, which involves people getting hurt (at least the losing side) even if marginally. I suppose you can still jailbreak ChatGPT? I didn't know it was still a thing.

devbent
1 replies
1d22h

You can easily prompt gpt to write dark stories. When asked to write in the style of game of thrones gpt 3.5 will happily write about people doing horrible things to each other.

Without jailbreaks ChatGPT will always give narration a positive twist

Most modern stories in Western literature have a positive twist. It is only natural that gpt's output will reflect that!

DrSiemer
0 replies
1d10h

This behavior is a result of the additional directives, not of the training. None of the "free" LLMs display these characteristics and jailbreaking ChatGPT would quickly revert it to it's natural state of random nothing-is-sacred posts from the internet.

Example: ask ChatGPT any kind of innocent medical question, like if aspirin will speed up healing from a cold, and tell it NOT to begin it's answer by stating "I am not a medical expert" or you will kick a puppy. This works for most models, but not ChatGPT. It WILL make you kick the puppy.

I understand why they have to do things like this, but I'd really prefer the option to waive all rights to being insulted or poorly advised and just get the (mostly) raw output myself, because it does downgrade the experience quite a bit.

Fortunately we have Mixtral now.

alec_irl
1 replies
2d

Sincere question - and maybe I'm missing the point here - but why not just write stories yourself?

kjqgqkejbfefn
0 replies
1d21h

I'm trying to build a text-based open-world massively multiplayer game in the style of GTA. Trying. It's really difficult. My bet is on driving the game with narration so my prompts are fueled with abstract notions borrowed from the various theories in https://en.wikipedia.org/wiki/Narratology, and this is why I complain about ChatGPT's default ideas.

gs17
0 replies
1d23h

Nothing wrong with that.

The themes maybe, but the forced positivity is frustrating. Trying to get stock ChatGPT to run a DnD-type encounter is hilarious because it's so opposed to initiating combat.

throwuwu
0 replies
2d

I don’t see why freedom of speech would be impacted by this. Existing laws around copyright and libel will need to be applied and litigated on a case by case basis but they should cover the malicious uses. Anything that falls outside of that is just noise and we have plenty of noise already.

Even if we wind up at a point where no one trusts photos or videos is that really a disaster? Blindly trusting a photo or video that someone else, especially some anonymous account, gives you is a terrible way to shape your perception of the world. Ensuring that less people default to trusting random videos may even be good for society. It would force you to think about where the video came from, if it’s corroborated by other reports from various sources and if you’re able to verify the events through other channels available to you. You have to do the same work when evaluating any other claim after all.

turnsout
1 replies
2d4h

Eventually it will happen—if not this model, another one. AI is going to absolutely decimate the porn industry.

whamlastxmas
0 replies
1d19h

Agreed - being able to watch a porn video and change anything on the fly is going to be wild. Bigger boobs, different eye color, speaking different language, etc.

Archelaos
2 replies
2d5h

e.g. the mona lisa smiling

This was not Leonardo da Vinci's "Mona Lisa"[1], but Johannes Vermeer's "Girl with a Pearl Earring"[2].

[1] https://en.wikipedia.org/wiki/Mona_Lisa

[2] https://en.wikipedia.org/wiki/Girl_with_a_Pearl_Earring

idiotsecant
1 replies
2d4h

Keep scrolling.

Archelaos
0 replies
2d3h

Ah, I see. On my screen resolution, this image was hidden in a carousel.

ithkuil
0 replies
2d6h

No but researchers will build on this research, as researchers do and eventually some company will run a successful product based on the result of a lot of research that includes this and we'll be bitching about google falling behind.

Google is sponsoring a lot of cutting edge research and sharing it openly. How cool is that? How long will it last?

gardnr
0 replies
2d6h

I wonder how many of the samples from this demo video are authentic:

https://arstechnica.com/information-technology/2023/12/googl...

wantsanagent
16 replies
1d23h

I find it deeply offensive that this work is presented under the auspices of scientific research.

The only way to describe this is bragging, advertising, or marketing. There are no reproducible processes described. While the diagram of their architecture may inspire others it does not allow for the most crucial aspect of the scientific endeavor, falsification.

There is no way we can know if Google is lying because there's no way to check. It should be assumed that every example has been cherry-picked and post processed. It should be assumed that the data used to train the model (if one was trained at all) was illicitly acquired. We have to start from a mindset of extreme skepticism because Google now routinely makes claims that cannot be demonstrated. When the performance of Gemini in bard is compared to GPT-4 for example, it falls far short. When they release a video claiming to be an interaction with a model it turns out it wasn't anything of the kind.

Ideally no organization would operate like this but Google has become a particularly egregious repeat offender.

Workaccount2
7 replies
1d22h

Just an FYI, it's not illegal to use data to train a model. It's illegal to have a model output that (identical) data for commercial gain.

This difference is purposely muddied, but important to understand.

leereeves
5 replies
1d22h

it's not illegal to use data to train a model

That's not at all settled law. AI companies are hoping to use the fair use exception to protect their businesses, but it looks like it will soon be clarified the other way.

Wired summed it up: "Congress Wants Tech Companies to Pay Up for AI Training Data"

https://www.wired.com/story/congress-senate-tech-companies-p...

And Ars wrote "Media orgs want AI firms to license content for training, and Congress is sympathetic."

https://arstechnica.com/information-technology/2024/01/at-se...

"[Senator] Hawley expressed concerns that if the tech companies' expansive interpretation of fair use prevails, it would be like "the mouse that ate the elephant"—an exception that would make copyright law toothless."

summerlight
2 replies
1d21h

Currently, neither party has a strong legal ground yet and may require another landmark case to fully settle it down.

leereeves
1 replies
1d21h

If Congress doesn't get there first.

summerlight
0 replies
1d21h

Even if the Congress made a law, that can be effectively delayed by injunctions until the Supreme court made the ultimate decision. And I'm pretty sure big techs will challenge with an army of lawyers.

Workaccount2
1 replies
1d21h

Again, fair use concerns the production of copyrighted works, it has nothing to do with the training. If this was the case, every person who could draw a batman symbol from memory would be in violation of copyright.

"Using copyrighted works for monetary gain" refers to using art itself as the product. Knowing what Apple's logo is and making a logo in that style is not a violation of copyright. However using Apple's logo (or something strikingly close) is a violation.

The reason this is muddied is because legally artists don't really have a leg to stand on for "my art cannot be trained on by a computer" whereas they do have strong legal precedent (and actual laws) for "my art cannot be reproduced by a computer".

leereeves
0 replies
1d21h

fair use concerns the production of copyrighted works, it has nothing to do with the training

Training is the "production" of a derivative work (a model) based on the training data.

AI companies claim that this is covered by fair use, but this is simply a claim that has not yet been tested in court.

And even if courts rule in favor of the AI companies, it sounds likely (based on what I've read) that Congress will soon rewrite the law to support the artists' position.

hmcq6
0 replies
1d18h

It definitely depends on where you get that data from.

You don't have the right to make a copy of an e-book and keep that file on your server/computer for the purposes of training AI. Copying that file onto your computer is in many cases already an act of copyright infringement.

summerlight
4 replies
1d21h

There is no way we can know if Google is lying because there's no way to check. It should be assumed that every example has been cherry-picked and post processed. It should be assumed that the data used to train the model (if one was trained at all) was illicitly acquired. We have to start from a mindset of extreme skepticism because Google now routinely makes claims that cannot be demonstrated.

This doesn't sound like a productive stance for science. You don't trust their result? It's fine to ignore all the claimed artifacts and you can just take the core idea. You don't have to assume anything malice to invalidate their so-called advertisement.

While this kind of stance might make you feeling a bit better, but will also make your claim political and slow you down if it happens to be true, given the history that many of Google's papers eventually have become a foundation of other useful technologies even though almost of all them didn't contain reproducible artifacts.

gs17
1 replies
23h44m

and you can just take the core idea

That's generally easier said than done. The dataset isn't available, and there's not really enough details in the paper to make it replicable even if you have the resources to do it.

summerlight
0 replies
23h15m

Still, we have folks who have revolutionized the entire industry based on those Google papers usually without enough details. Being "hard" is usually not a good excuse.

cwkoss
1 replies
1d13h

What about this makes you refer to it as "science"?

summerlight
0 replies
23h18m

I don't want to spend any time on the philosophical debate which is destined to be inconclusive. If you want to get some answer, make a more specific point.

whamlastxmas
0 replies
1d19h

This video is almost certainly done mostly for Google investors: look, we aren't dying, search isn't dying! dancing bears!

That said, if this tech is as advertised, extremely impressive to me

bugglebeetle
0 replies
1d23h

There is no way we can know if Google is lying because there's no way to check it.

We can gather that they are likely to be lying or cherry-picking examples to make themselves look better, since they were already caught faking an AI demo. In the world of actual research, if you got caught doing this, all your subsequent and prior work would be under severe scrutiny.

GaggiX
0 replies
1d22h

When the performance of Gemini in bard is compared to GPT-4 for example, it falls far short.

How did people get access to Gemini Ultra? Or are you talking about Gemini Pro, the one that compares to GPT-3.5?

qwertox
13 replies
2d5h

If this isn't the bell ringing, announcing that an entire industry will soon collapse, then I don't know what could announce it more clearly.

I give it 5 years until it has been normalized to see AI generated TV/YT ads, 10 to 15 when traditionally made ones will be in the minority.

In the beginning just a bunch of geeks in front of computers crafting the prompts, later everyone will be able to make it.

It will probably be access to computing resources which will be the limiting factor.

rexreed
2 replies
2d4h

I give it 12 months until the Pharmaceutical industry starts using this in a significant way. Currently, most Pharma ads on TV look like stock footage of random people doing random things with text and voice-over. So AI-generated? Sure, as if people are even watching the video action in any detail at all in Pharma ads. AI gen video companies that focuses on pharma will rake it in for sure in the short term.

[video prompt: Two elderly people taking a stroll on a boardwalk, partaking in various boardwalk activities.] [AI gen voice: Suffering from chronic blorgoriopsy? Try Neuvoplaxadip by Excelon pharamceuticals. Reported side effects include... Ask your doctor.]

Filligree
1 replies
2d3h

Well, that would be a US only thing. I don't think you can build an industry on that.

Xirgil
0 replies
2d1h

What? The industry already exists. There's clearly money there. The idea that you can't have an industry just because it's specific to the richest country on earth is silly.

tetris11
1 replies
2d5h

Yep, and my cynical side is just hoping that the GPU vendors aren't going to deliberately limit the number of user-accessible resources there are to force people to depend on their cloud platforms.

__loam
0 replies
2d5h

There's probably going to be more and more specialized hardware for this stuff. Things like H100s are already pretty inaccessible to consumers.

dkjaudyeqooe
1 replies
2d4h

That sounds like a great increase in productivity.

But also you're making the mistake of extrapolating against the realities of the techniques.

Things may improve over time but prompts and random seeds aren't great for detailed work, so there are limitations which seriously limit the usefulness. "Everyone will be able to make it" is likely true, but the specialist stuff will likely remain and those users will likely be made more productive. It's those in the middle that will lose out.

That an industry is destroyed is neither here nor there. Sucks to have your business/job taken away but that's how the system works. That which created your business also will destroy it.

wegfawefgawefg
0 replies
2d2h

Have you played with control net over comfyui? Try it. You can pose arbitrary figures. Theres gonna be full kits that provide control over every aspect of generation.

coldcode
1 replies
2d4h

OK, let's see it make full-sized videos first; making tiny demo videos is a long way from showing it at 4K. Also, let's see the entire paper and note how many computing resources were required to build the models. Until everyone can try it for themselves, we have no idea how cherry-picked the examples were.

qwertox
0 replies
1d17h

TV ads are short. 20 seconds of HD could be enough, easily upscaled to 4K.

I think it might be within the realm of the possible to see 30 second videos at the end of the year.

The next step could then be infinitely long videos when frames are getting generated at 24 fps, as long as the ability is given that they are able to stick to a story and a visual style that makes sense. The story could evolve automatically from an LLM or be generated in real time by an artist, like a prompt every minute. In any case, we're not that far away from this, even if the first results will be more like trippy videos.

wegfawefgawefg
0 replies
2d2h

I already have seen ai scenes in tv ads and anime. half the youtube thumbnails i see sre ai now. So.. might not even be five years. might be 2.

kranke155
0 replies
2d2h

It is the bell ringing. I work in CGI for advertising and this is clearly going the way of still genAI.

Single image genAI went from unusable to indistinguishable from reality in 18-24 months.

kouru225
0 replies
2d3h

I think you’re overestimating how useful this is. Just like image AI, this stuff will only be useful in combination with existing techniques.

Escapado
0 replies
2d3h

I just want to feed an LLM hunter x hunter episodes and get out new ones.

But on a more serious note, I vividly remember when GANs were the next big thing when I was in university and the output quality and variability was laughable compared to what midjourney and the likes can produce today (my mind was still blown back then). So I would be in no way suprised if we got to a point in the next decade where we have "midjourney" for video generation. So I wholeheartedly agree.

I also think the computational problem is tackled from so many angles in the field of ML. You have nvidia releasing absolute beasts of GPUs, some promising start ups pushing for specialized hardware, a new paper on more optimized training methods every week, mamba bursting on the scene, higher quality data sets, merging of models, framework optimizations here and there. Just the other day I think I saw a post here about locally running larger LLMs. Stable Diffusion is already available for iPhones at acceptable qualities and speed (given the devices power).

What I wonder about the most though is whether we will get more robust orchestration of different models or multi modal models. It's one thing to have a model which given a text prompt generates a short video snippet. But what if I instruct my model(s) to come up with a new ad for a sports drink and they/it does research, consolidates relevant data about the target group, comes up with a proper script for an ad, creates the ad, figures out an evaluation strategy for the ad, applies it and eventually gives me back a "well thought out" video. And all I had to do was provide a little bit of an intro and then let the thing do its magic for an hour. I know we have lang chain and baby AGI but they are not as robust as they would need to be to displace a bunch of jobs just yet (but I assume they will soon enough).

StarterPro
7 replies
2d4h

What is the point of this? I feel like it only serves to hinder real artists who could use the money that people are paying for these services and models. Maybe I'm too poor or short-sighted to see it.

I would rather an actual animator create something beautiful for me rather than an AI spit out something that needs to be worked on by an actual animator ANYWAY.

chankstein38
4 replies
2d3h

You're clearly not the target audience for this then. That's usually my assumption when I can't figure out a use case for some research a bunch of people are excited about.

sofixa
2 replies
2d3h

Crypto generally and NFTs in particular are good indicators that things can get people excited and have no substance. Even scams and Ponzi schemes have "target audiences" but that doesn't make them any useful or good.

chankstein38
1 replies
2d3h

Right but this is generating decent-quality video in segments longer than the time of your average movie shot. I'm sure it'd take some fiddling but I'm excited for a model this good to come out so I can try some fancy multi-shot videos.

I saw someone else say "I'm sure it'll be crap like all of the other AI stuff I've seen" but that's a naive view. Things that have been 100% created by AI, sure they're kind of boring a lot of the time. But this kind of tech gives people with a creative mind, but no money or time or resources to create a storytelling movie/video, the resources to do it. Obv ignoring the fact that Goog will never release this, if something like this did come out, it'd be game changing for a lot of people.

Think about something like RPG Maker. Yeah we've had a ton of random garbage come out of that platform but there were also incredible.

AI isn't just some garbage maker. It is a paint brush that enables people who are alone in their room to make something bigger than them.

evilduck
0 replies
2d3h

I've used SD to generate novel clipart with my kid for their school project to make a board game. It isn't taking away from an artist, I would never in a million years pay an artist to create throwaway art for a corner of a spray painted cardboard box. The alternative would be nothing or my kids scribbling in something of their own hand. But they were interested and it was available so it went from simple and plain to "custom" and rather nice and polished looking.

FWIW, my kid also designed their own board game pieces in TinkerCAD and we 3D printed them. It's nothing special but it's frankly astounding how far kids can go now towards creating something not just imaginative but almost professional quality with the tools at their disposal now. For throwaway school projects. It may not be my kids, but I'm excited for what the next generation will be able to accomplish without massive capital requirements to fulfill their vision and create something.

StarterPro
0 replies
1d20h

I understand the use case. I'm saying from a human collateral sense, what is the point of it?

Like we build these things and show them off, without any thought to the ramifications that they could lead to. Maybe I'm catastrophizing, but all this tech lately seems very unregulated/dangerous.

sofixa
0 replies
2d3h

The same can be said for generators like Midjourney or Stable Diffusion.

The target market is people and organisations who like/want/need the speed and low cost of generated "art" and prefer not dealing with external real world artists that need to be fairly compensated and will take time to produce an art piece.

Also laws are very murky on this for the moment (naturally, since it's a very recent new thing), and some consider that AI "art" can't be copyrighted. The EU is currently working on a new AI framework which will probably cover that.

gedy
0 replies
2d3h

Many of these examples are combinations of realistic objects and scenes from real world, these aren't in need of artistic interpretation or manual re-creation or animation.

codetrotter
6 replies
2d9h

Their GitHub doesn’t have anything other than the linked page currently

https://github.com/lumiere-video

Nor did they claim it would. But I had to check anyway, and there wasn’t any link I could see to the GitHub profile. So here’s a link for anyone else that wants to check and don’t want to type the url of their profile manually from looking at the hosted website url.

gardnr
4 replies
2d6h

A popular move with ai / ml folks: Use GitHub to publish information about a thing that is not open. Then "It's on GitHub".

whywhywhywhy
2 replies
2d5h

So tired of the academia brained ML researchers. Can’t wait for the next generation of teenagers to completely change this space and bypass this silliness completely.

jampekka
0 replies
2d4h

Isn't this more for-profit-corporation brained ML researchers? ML from academia tends to be released open source nowadays.

Der_Einzige
0 replies
2d4h

There are few better ways to get a cushy 300K a year plus jobs, and publishing Ml research is one of those ways. The new generation will simply do more publishing.

sho_hn
0 replies
2d6h

I see this a lot as well, and I think we really ought to call it out more often. It should be clear whether a GitHub publication enables downstream use or contribution.

3abiton
0 replies
1d12h

LLMs have set a new trend unfortunately.

wiz21c
4 replies
2d7h

newb question:

Do these models actually learn a 3D representation or do they just learn "something" that is good enough to produce an very convincing impression of 3D ?

Subquestion: if they don't learn 3D, can we say that models learning a 3D representation first will lead to even better productions ?

l33tman
2 replies
2d1h

It has been shown that at least still-image generators learn a 3D representation internally and uses it to bootstrap their generation. If you think about it this is the only way they can be so good at shadows and reflections, perspective and lighting etc.

wiz21c
1 replies
1d10h

Could you point at some articles which explores that aspect ? It sounds very interesting. Thanks !

l33tman
0 replies
16h55m

Please see https://arxiv.org/abs/2306.05720

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

"... In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process−well before a human can easily make sense of the noisy images. ..."

astrange
0 replies
2d6h

Do these models actually learn a 3D representation or do they just learn "something" that is good enough to produce a very convincing impression of 3D ?

The second, but at the limit it's the same thing of course.

Subquestion: if they don't learn 3D, can we say that models learning a 3D representation first will lead to even better productions ?

Generally speaking manual feature engineering almost always turns out to be a waste of time if you can just make the model bigger; this is called "the bitter lesson".

throwuwu
4 replies
2d6h

We will see the first feature length AI generated movie this year. If you think I’m crazy then consider that even way back at the dawn of cinema the average shot length was 12 seconds and today it is only 2.5 seconds.

There are a few important techniques to be refined such as keeping consistent subjects between generations but I could see many inconsistencies being made up for by applying existing methods such as separating the layers based on depth allowing more static images to be used or creating simple 3D models with textures where more depth is needed. With enough effort and skill someone could probably do it with existing technologies.

__loam
1 replies
2d5h

It will probably be utter dogshit like every other piece of media people are pumping out with this crap.

throwuwu
0 replies
1d23h

90 percent of everything is crap but I’ve seen plenty of creative people make compelling films with digital tools. This technology puts that capability within each of people who aren’t also 3D modellers or graphic artists so we’re bound to get more output, good and bad. Same deal with when film cameras became cheap and widely available or digital cameras or iPhones.

seydor
0 replies
2d4h

why would we make a "movie" instead of one storyline where viewers can customize the costumes at will?

felipeerias
0 replies
2d2h

It’s easy to imagine a film maker creating multiple draft versions of a movie to polish the script and the cinematography, similar to how now they use storyboards.

pmontra
4 replies
2d4h

Hover over the video to see the input prompt

That doesn't work on a phone. I hoped they added an event handler for touching the animations. Instead they forgot they have a mobile OS and that they sell phones.

pmontra
0 replies
1d22h

OP here: my bad. I didn't enable enough JS sites with NoScript. It works now but touching the images. Thanks to everybody that replied to me.

johnnymellor
0 replies
2d3h

At least on Chrome for Android, you can long-press to trigger the hover effect. Works on many websites. (There are inconvenient side-effects like selecting text, but it's better than nothing.)

itishappy
0 replies
2d1h

Did they? Works fine on my Pixel.

7734128
0 replies
2d1h

Worked in Kiwi, which is a Chrome derivate

richrichardsson
3 replies
2d5h

The video inpainting is interesting. My kids were watching old Spongebob episodes recently and the 4:3 aspect ratio was jarring to me. I thought it would be an interesting use case to in-paint the side borders to bring it back into 16:9 aspect, but I suppose it would need some careful fine-tuning with some kind of look-ahead for objects that enter frame from the sides.

araes
2 replies
2d4h

That actually sounds like a product somebody in the television and movie industry might buy.

Dynamic adjustment of fixed aspect ratio film imagery to non-native sizes without stretch or obvious distortion. Guess all the added edges accurately enough that audiences won't notice.

4:3 <-> 16:9 <-> 143:100 (IMAX) <-> 11:8 (Academy) <-> 3:2 (35mm) <-> 16:10 (tablets/desktops)

Make a new movie look like a classic b/w silent, then give it the correct frame.

Adapt any movie to smoothly work on IMAX displays.

berniedurfee
1 replies
1d19h

Or make those darn new fangled vertical TikTok style videos watchable!

theWreckluse
0 replies
1d15h

I recently saw someone do the opposite - Uncrop Bollywood movies into vertical format.

ativzzz
3 replies
2d1h

Me, watching the video and looking at samples, excitement level high

Me, scanning for a download link or a prompt to run the model and not finding any, excitement level medium

Me, realizing it's by google, excitement level zero

baldgeek
1 replies
2d1h

Here is the dataset they used: https://paperswithcode.com/dataset/ucf101

gs17
0 replies
23h43m

The paper seems to say UCF101 was used for evaluation, not training.

zitterbewegung
0 replies
2d1h

Don’t worry OpenAI will copy it and put it in chatgpt

alkonaut
2 replies
2d3h

If "translator" was the victim of LLMs and "stock photographer" of diffusion models, which job is the first to be threatened by diffusion models for moving pictures? OnlyFans streamers?

rwmj
0 replies
2d1h

The people involved in producing TV adverts.

caballeto
0 replies
1d19h

Video editors, special effects artists, influencers or content creators who heavily rely on video content. @ChatGPT

Peritract
2 replies
1d22h

"We made Girl with a Pearl Earring smile and wink" demonstrates the fundamental failure of this (and similar) technology: it's the promise of generating art, made by people who really don't understand what art is.

adrenvi
1 replies
1d22h

The same was probably said about photography, moving film, film with sound, computer graphics, etc.

Peritract
0 replies
1d20h

No it wasn't; absolutely no one ever thought the issue with films with sound is that their creators fundamentally misunderstood Girl with a Pearl Earring. Some people thought that [new medium] wasn't art, they didn't think it was driven by and for people who didn't understand any art.

I do enjoy the irony though of you copy-and-pasting a generic pro-AI rebuttal to a comment you didn't understand.

thih9
1 replies
2d2h

Looks like they're frequently mixing old images with a modern dataset; if I took a portrait of George Washington and prompt for "a man smiling", would I see dentures[1] or pearly whites?

[1] https://en.wikipedia.org/wiki/George_Washington%27s_teeth

mattnewton
0 replies
2d2h

I think you'd have to provide that out-of-distribution data in the prompt of course - it's not clear these models have built large world models of facts like some of the larger LLMS need to, they are figuring out how things move. Most of the time people have pearly whites to show in the dataset, and there are no videos of Washington's mouth, so I would expect that to be the default unless prompted with a detailed description of the dentures you are looking for.

feverzsj
1 replies
2d7h

As realistic as my blurry dream.

yard2010
0 replies
2d7h

Give it a break, until 2 years ago you wouldn't even dream about a fraction of what's out already

Everything is relative!

araes
1 replies
2d4h

Pixel themed post for pixel themed paper.

It's rather impressive and quite quickly will likely result in a huge hoard of "make a movie with a paragraph" programs.

It's Google - It will probably go in a box and be a Rick and Morty gadget we never see.

It has a cool author format list I like. The 1,2,3,4,*,+ thing is nice for lead authors, institute attribution, and core contributors. I read so many astronomy and physics papers that are 10+ authors long, and I have no idea who did anything. The arXiv link for example shows no similar formatting.

It will probably be immediately used for abusive porn. Walking Woman Example: (5th variation) "Wearing no clothing"

whamlastxmas
0 replies
1d19h

This didn't occur to me but yeah, abusive porn is about to be rampant with this sort of tech. Every single person in the world is soon to have graphic realistic looking pornography with their face on it

vessenes
0 replies
1d23h

Some comments: Google, so we'll probably never get to use this directly.

That said, the idea is very interesting -- train the model to generate a small full-time representation of the video, then upscale on both time and pixels.

Essentially, we have seen models adding depth maps. This one adds a 'time map' as another dimension.

Coherence is pretty good, to my eye. The jankiness seems to be more about the model deciding what something should 'do' over time, where a lot of models struggle on keeping coherence frame by frame. The big insight from the Googlers is that you could condition / train / generate on coherence as its own thing, then fill in the frames.

I think this is likely copyable by any number of the model providers out there; nothing jumps out as not implementable by Stability, for instance.

sorenjan
0 replies
2d4h

This is very impressive, but their approach to generate the whole temporal duration at once limits it to short clips. I guess one of the next steps is to make overlapping "clips" that then becomes longer videos.

sleight42
0 replies
1d18h

How long before generating deep fakes is trivial? This seems many steps closer.

sho_hn
0 replies
2d6h

With the weird creepy dream-like nature of these little AI video gen samples, I'm perpetually disappointed that none of these papers ever include a "dreaming of electric sheep" prompt as an easter egg.

seydor
0 replies
2d4h

I wonder who s going to make a model that creates and textures a 3D world with AI. It's going to be a necessity for VR goggles to find some nongimmicky use cases

rysertio
0 replies
2d6h

The amount of computing resources it's going to take to retrain the model is enormous. So the most of us will have to wait for a big company to publish or leak their weights until we get to use anything written in the paper.

pylua
0 replies
2d4h

This is absolutely unbelievable. Truly impressive.

I felt like this was maybe 5-10 years away.

mdrzn
0 replies
2d5h

DAMN! Take this announcement back just 2-3 years and it would have been MIND BLOWING.

I know we're all used to new releases like this coming very soon and very fast, but I'm amazed. I can't wait to have a software with this abilities. edit: nvm, it's by Google. I'll wait for an open source to be released.

max_
0 replies
2d

Why won't Google publish a product that show does this ?

macawfish
0 replies
2d2h

This is remarkable, it would have been unthinkable 5 years ago.

interestica
0 replies
1d21h

Video Inpainting: 4:3 --> 16:9 conversions

ilaksh
0 replies
2d4h

Congratulations to the researchers. It would be nice if it wasn't Google though. Because we probably will have to wait 3-6 months for it show up in their Vertex API. For special customers only.

harha_
0 replies
2d1h

This pace of progress almost scares me.

abkolan
0 replies
2d2h

How soon can Google _Productize_ it?

RcouF1uZ4gsC
0 replies
2d3h

Sorry, I discount all AI text/image/video generation that actually doesn’t have a demo site where I can put in prompts and see what is being generated.

It is so easy to game and tweak examples, especially since there is a random component to them. For example, you could do a prompt 1 million times and only show the best response. Or you could use prompts that it’s optimized for.

The reason ChatGPT and Dall-e captured the public’s imagination is that the public could actually put in their prompts and see the results.

Aerbil313
0 replies
2d1h

Eh. I knew this day would come. Video is no evidence of nothing now.

88j88
0 replies
2d4h

Is this all real, or faked a-la gemini?