return to table of content

VideoGigaGAN: Towards detail-rich video super-resolution

metalrain
54 replies
3d3h

Video quality seems really good, but limitations are quite restrictive "Our model encounters challenges when processing extremely long videos (e.g. 200 frames or more)".

I'd say most videos in practice are longer than 200 frames, so lot more research is still needed.

anvuong
31 replies
3d2h

At 24fps that's not even 10 seconds. Calling it extremely long is kinda defensive.

bufferoverflow
27 replies
3d1h

The average shot length in a modern movie is around 2.5 seconds (down from 12 seconds in 1930's).

For animations it's around 15 seconds.

mateo1
20 replies
3d

Huh, I thought this couldn't be true, but it is. The first time I noticed annoyingly fast cuts was World War Z, for me it was unwatchable with tons of shots around 1 second each.

jonplackett
14 replies
3d

So sad they didn’t keep to the idea of the book. Anyone who hasn’t read this book you should, it bares no resemblance to the movie aside from the name.

danudey
6 replies
3d

It's offtopic, but this is very good advice. As near as I can tell, there aren't any real similarities between the book and the movie; they're two separate zombie stories with the same name, and honestly I would recommend them both for wildly different reasons.

robertlagrant
2 replies
2d19h

there aren't any real similarities between the book and the movie; they're two separate zombie stories with the same name

Funny - this is also a good description of I Am Legend.

btown
1 replies
2d13h

And similarly, I, Robot, which is much more enjoyable when you realize it started as an independent murder-mystery screenplay that had Asimov’s works shoehorned in when both rights were bought in quick succession. I love both the movie and the collection of short stories, for vastly different reasons.

https://www.cbr.com/i-robot-original-screenplay-isaac-asimov...

robertlagrant
0 replies
2d9h

Will Smith, a strange commonality in this tiny subgenre.

jonplackett
2 replies
3d

I didn’t rate the film really, but loved the book. Apparently it is based on / taking style inspiration from real first hand accounts of ww2.

KineticLensman
1 replies
2d20h

It’s style is based on the oral history approach used by Studs Terkel to document aspects of WW2 - building a big picture by interleaving lots of individual interviews.

jonplackett
0 replies
2d19h

Making the movie or a documentary series like that would have been awesome.

scns
5 replies
2d22h

I know two movies where the book is way better, Jurassic Park and Fight Club. I thought about putting spoilers in a comment to this one but i won't.

jonplackett
1 replies
2d19h

The lost world is also a great book. It explores a lot of interesting stuff the film completely ignores. Like that the raptors are only rampaging monsters because they had no proper upbringing having been been born in the lab with no mama or papa raptor to teach them social skills

gattr
0 replies
2d10h

But hey, at least we finally got the motorcycle chase (kind of) in "Jurassic World"! (It's my favourite entry in the series, BTW.)

hobs
1 replies
2d16h

Disagree, Jurassic Park was an amazing movie on multiple levels, the book was just differently good, and adapting it to film in the exact format would have been less interesting (though the ending was better in the book.)

jonplackett
0 replies
2d9h

I totally forgot the book ending! So much better.

I think like the motorcycle chase that they borrowed from the lost world in Jurassic world, they also have a scene with those tiny dinosaurs pecking someone to death.

mrbombastic
0 replies
2d19h

Also The Godfather. No Country for old Men I wouldn’t say is better but is fantastic.

sizzle
0 replies
2d22h

Loved the audiobook

philipov
1 replies
2d22h

Batman Begins was already in 2005 basically just a feature length trailer - all the pacing was completely cut out.

epolanski
0 replies
2d18h

Yes, Nolan improves on that in later movies but he used to abuse of it.

Another movie of him that crimes of this non stop is The Prestige.

lelandfe
1 replies
3d

Yeah, the average may also be getting driven (e: down) by the basketball scene in Catwoman

p1mrx
0 replies
3d

[watches scene] I think you mean the average shot length is driven down.

throwup238
2 replies
3d1h

The textures of objects need to maintain consistency across much larger time frames, especially at 4k where you can see the pores on someone's face in a closeup.

vasco
0 replies
3d

I'm sure if you really want to burn money on compute you can do some smart windowing in the processing and use it on overlapping chunks and do an OK job.

bookofjoe
0 replies
3d

Off topic: the clarity of pores and fine facial hair on Vision Pro when watching on a virtual 120-foot screen is mindblowing.

kyriakos
0 replies
3d

People won't be upscaling modern movies though.

danudey
0 replies
3d

Sure, but that represents a lot of fast cuts balanced out by a selection of significantly longer cuts.

Also, it's less likely that you'd want to upscale a modern movie, which is more likely to be higher resolution already, as opposed to an older movie which was recorded on older media or encoded in a lower-resolution format.

arghwhat
0 replies
2d21h

I believe the relevant data point when considering applicability is the median shot length to give an idea of the length of the majority of shots, not the average.

It reminds me of the story about the Air Force making cockpits to fit the elusive average pilot, which in reality fit none of their pilots...

lupusreal
2 replies
3d2h

10 seconds is what, about a dozen cuts in a modern movie? Much longer has people pulling out their phones.

jsheard
0 replies
3d2h

:( "Our model encounters challenges when processing >200 frame videos"

:) "Our model is proven production-ready using real-world footage from Taken 3"

https://www.youtube.com/watch?v=gCKhktcbfQM

boogieknite
0 replies
2d22h

Freal. To the degree that i compulsively count seconds on shots until a show/movie has a few shots over 9 seconds then they "earn my trust" and i can let it go. Im fine

chompychop
10 replies
3d3h

I guess one can break videos into 200-frame chunks and process them independent of each other.

whywhywhywhy
6 replies
3d2h

Not if there isn't coherency between those chunks

anigbrowl
5 replies
2d23h

Easily solved, just overlap by ~40 frames and fade the upscaled last frames of chunk A into the start of chunk B before processing. Editors do tricks like this all the time.

y04nn
1 replies
2d20h

And now you end up with 40 blurred frames for each transition.

anigbrowl
0 replies
2d17h

'before processing'

readyman
1 replies
2d19h

Decent editors may try that once, but they will give up right away because it will only work by coincidence.

epolanski
0 replies
2d18h

There has to be a way where you can do it intelligently in chunks and reduce noise along the chunk borders.

Moreover I imagine that further research and power will do a lot, smarter, and quicker.

Don't forget people had toy story-comparable games in a decade or so after it was originally rendered at 1536x922.

drra
0 replies
2d10h

Or upscale every 4th frame for consistency. Upscaling in between frames should be much easier.

prmoustache
2 replies
3d3h

At 30fps, which is not high, that would mean chunks of less than 7 seconds. Doable but highly impractical to say the least.

IanCal
1 replies
3d2h

7s is pretty alright, I've seen HLS chunks of 6 seconds, that's pretty common I think.

srveale
0 replies
3d

Our invention works best except for extremely long flight times of 13 seconds

KeplerBoy
1 replies
3d2h

Fascinating how researchers put out amazing work and then claim that videos consisting of more than 200 frames are "extremely long".

Would it kill them to say that the method works best on short videos/scenes?

jsheard
0 replies
3d2h

Tale as old as time, in graphics papers it's "our technique achieves realtime speeds" and then 8 pages down they clarify that they mean 30fps at 640x480 on an RTX 4090.

madduci
0 replies
3d

I think it encounters memory leaks and the usage of memory goes over the roof

kyriakos
0 replies
2d22h

If I am understanding the limitations section of the paper it seems like the 200 frames depends on the scene, it may be worse or better.

kazinator
0 replies
3d

Break into chunks that overlap by, say, a second, upscale separately and then blend to reduce sudden transitions in the generated details to gradual morphing.

The details changing every ten seconds or so is actually a good thing; the viewer is reminded that what they are seeing is not real, yet still enjoying a high resolution video full of high frequency content that their eyes crave.

geysersam
0 replies
2d23h

Wonder what happens if you run it piece-wise on every 200 frames. Perhaps it glitches in the interface.

cryptonector
0 replies
3d

It's good enough for "enhance, enhance, enhance" situations.

babypuncher
0 replies
3d

Well there goes my dreams of making my own Deep Space Nine remaster from DVDs.

anigbrowl
0 replies
2d23h

If you're using this for existing material you just cut into <=8 second chunks, no big deal. Could be an absolute boon for filmmakers, otoh a nightmare for privacy because this will be applied to surveillance footage.

scoobertdoobert
28 replies
3d1h

Is anyone else concerned at the societal effects of technology like this? In one of the examples they show a young girl. In the upscale example it's quite clearly hallucinating makeup and lipstick. I'm quite worried about tools like this perpetuating social norms even further.

roughly
14 replies
3d1h

Yes, but if you mention that here, you’ll get accused of wokeism.

More seriously, though, yes, the thing you’re describing is exactly what the AI safety field is attempting to address.

Culonavirus
12 replies
3d1h

is exactly what the AI safety field is attempting to address

Is it though? I think it's pretty obvious to any neutral observer that this is not the case, at least judging based on recent examples (leading with the Gemini debacle).

fwip
6 replies
3d

Yes, avoiding creating societally-harmful content is what the Gemini "debacle" was attempting to do. It clearly had unintended effects (e.g: generating a black Thomas Jefferson), but when these became apparent, they apologized and tried to put up guard rails to keep those negative effects from happening.

Culonavirus
3 replies
3d

societally-harmful content

Who decides what is "societally-harmful content"? Isn't literally rewriting history "societally-harmful"? The black T.J. was a fun meme, but that's not what the alignment's "unintended effects" were limited to. I'd also say that if your LLM condemns right-wing mass murderers, but "it's complicated" with the left-wing mass murderers (I'm not going to list a dozen of other examples here, these things are documented and easy to find online if you care), there's something wrong with your LLM. Genocide is genocide.

HeatrayEnjoyer
1 replies
2d23h

This isn't the un-determinable question you've framed it as. Society defines what is and isn't acceptable all the time.

Who decides what is "societally-harmful theft"? > Who decides what is "societally-harmful medical malpractice"? > Who decides what is "societally-harmful libel"?

The people who care to make the world a better place and push back against those that cause harm. Generally a mix of de facto industry standard practices set by societal values and pressures, and de jure laws established through democratic voting, legislature enactment, and court decisions.

"What is "societally-harmful driving behavior"" was once a broad and undetermined question but nevertheless it received an extensive and highly defined answer.

N0b8ez
0 replies
2d22h

The people who care to make the world a better place and push back against those that cause harm.

This is circular. It's fine to just say "I don't know" or "I don't have a good answer", but pretending otherwise is deceptive.

fwip
0 replies
2d2h

Who decides what is "societally-harmful content"?

Are you stupid, or just pretending to be?

llm_nerd
1 replies
2d23h

What Gemini was doing -- what it was explicitly forced to do by poorly considered dogma -- was societally harmful. It is utterly impossible that these were "unintended"[1], and were revealed by even the most basic usage. They aren't putting guardrails to prevent it from happening, they quite literally removed instructions that explicitly forced the model to do certain bizarre things (like white erasure, or white quota-ing).

[1] - Are people seriously still trying to argue that it was some sort of weird artifact? It was blatantly overt and explicit, and absolutely embarrassing. Hopefully Google has removed everyone involved with that from having any influence on anything for perpetuity as they demonstrate profoundly poor judgment and a broken sense of what good is.

fwip
0 replies
2d2h

I didn't say the outcome wasn't harmful. I said that the intent of the people who put it in place was to reduce harm, which is obvious.

roughly
4 replies
3d

Yeah, I don’t think there’s such thing as a “neutral observer” on this.

Culonavirus
3 replies
3d

An LLM should represent a reasonable middle of the political bell curve where Antifa is on the far left and Alt-Right is on the far right. That is what I meant by a neutral observer. Any kind of political violence should be cosidered deplorable, which was not the case with some of the Gemini answers. Though I do concede that right wingers cooked up questionable prompts and were fishing for a story.

roughly
0 replies
2d23h

All of this is political. It always is. Where does the LLM fall on trans rights? Where does it fall on income inequality? Where does it fall on tax policy? "Any kind of political violence should be considered deplorable" - where's this fall on Israel/Gaza (or Hamas/Israel)? Does that question seem non-political to you? 50 years ago, the middle of American politics considered homosexuality a mental disorder - was that neutral? Right now if you ask it to show you a Christian, what is it going to show you? What _should_ it show you? Right now, the LLM is taking a whole bunch of content from across society, which is why it turns back a white man when you ask it for a doctor - is that neutral? It's putting lipstick on an 8-year-old, is that neutral? Is a "political bell curve" with "antifa on the left" and "alt-right on the right" neutral in Norway? In Brazil? In Russia?

Intralexical
0 replies
2d23h

Speaking as somebody from outside the United States, please keep the middle of your political bell curve away from us.

HeatrayEnjoyer
0 replies
2d23h

An LLM should represent a reasonable middle of the political bell curve where Antifa is on the far left and Alt-Right is on the far right. That is what I meant by a neutral observer.

This is a bad idea.

Equating extremist views with those seeking to defend human rights blurs the ethical reality of the situation. Adopting a centrist position without critical thought obscures the truth since not all viewpoints are equally valid or deserve equal consideration.

We must critically evaluate the merits of each position (anti-fascists and fascists are very different positions indeed) rather than blindly placing them on equal footing, especially as history has shown the consequences of false equivalence perpetuate injustice.

lupusreal
0 replies
3d

Nobody mentioned wokism except you.

mrandish
4 replies
2d18h

No, I'm not concerned. When an AI is trained on a largely raw, uncurated set of low-quality data (eg most of the public internet), it's going to miss subtle distinctions some humans might prefer that it make. I'm confident that pretty quickly the majority of the general public using such AIs will begin to intuitively understand this. Just as they have developed a practical, working understanding of other complex technology's limitations (such as auto-complete algorithms). No matter how good AI gets, there will always be some frontier boundary where it gets something wrong. My evidence is simply that even smart humans trying their best occasionally get such subtle distinctions wrong. However, this innate limitation doesn't mean that an AI can't still be useful.

What I am concerned about is that AI providers will keep wasting time and resources trying to implement band-aid "patches" to address what is actually an innate limitation. For example, exception processing at the output stage fails in ways we've already seen, such as AI photos containing female popes or an AI lying to deny that HP Lovecraft had a childhood pet (due to said pet having a name that was crudely rude 100 years ago but racist today). The alternative of limiting the training data to include only curated content fails by yielding a much less useful AI.

My, probably unpopular, opinion is that when AI inevitably screws up some edge case, we get more comfortable saying, basically, "Hey, sometimes stupid AI is gonna be stupid." The honest approach is to tell users upfront: when quality or correctness or fitness for any given purpose is important, you need to check every AI output because sometimes it's gonna fail. Just like auto-pilots, auto-correct and auto- everything else. As impressive as AI can sometimes be, personally, I think it's still lingering just below the threshold of "broadly useful" and, lately, the rate of fundamental improvement is slowing. We can't really afford to be squandering limited development resources or otherwise nerfing AI's capabilities to pursue ultimately unattainable standards. That's a losing game because there's a growing cottage industry of concern trolls figuring out how to get an AI to generate "problematic" output to garner those sweet "tsk tsk" clicks. As long as we keep reflexively reacting, those goalposts will never stop moving. Instead, we need to get off that treadmill and lower user expectations based on the reality of the current technology and data sets.

erhaetherth
2 replies
2d17h

AI lying to deny that HP Lovecraft had a childhood pet

GPT4 told me with no hesitation.

pezezin
0 replies
2d13h

I just tested it on Copilot. It starts responding and then at some point deletes the whole text and replies with:

"Hmm… let’s try a different topic. Sorry about that. What else is on your mind?"

mrandish
0 replies
2d16h

Ah, interesting. Originally, it would answer that question correctly. Then it got concern trolled in a major media outlet and some engineers were assigned to "patch it" (ie make it lie). Then that lie got highlighted some places (including here on HN), so I assume since then some more engineers got assigned to unpatch the patch.

I'll take that as supporting my point about the folly of wasting engineering time chasing moving goalposts. :-)

eigenvekt
0 replies
2d8h

I am not at all.

We seem to have a culture of completely paranoid people now.

When the internet came along every conversation was not dominated by "but what about people knowing how to build bombs???" the way most AI conversation flips to these paranoid AI doomer scenarios.

arketyp
2 replies
3d1h

I don't know, it's a mirror, right? It's up to us to change really. Besides, failures like the one you point out make subtle stereotypes and biases more conspicuous, which could be a good thing.

ixtli
0 replies
3d

Precisely: tools don't have morality. We have to engage in political and social struggle to make our conditions better. These tools can help but they certainly wont do it for us, nor will they be the reason why things go bad.

ajmurmann
0 replies
3d

It's interesting that the output of the genAI will inevitably get fed into itself. Both directly and indirectly by influencing humans who generate content that goes back into the machine. How long will the feedback loop take to output content reflecting new trends? How much new content is needed to be reflected in the output in a meaningful way. Can more recent content be weighted more heavily? Such interesting stuff!

unshavedyak
0 replies
3d1h

Aside your point: It does look like she is wearing lipstick tho, to me. More likely lip balm. Her (unaltered) lips have specular highlights on the tops that suggests they're wet or have lip balm to me. As for the makeup, not sure there. Here cheeks seem rosy in the original, and not sure what you're referring to beyond that. Perhaps her skin is too clear in the AI version, suggesting some type of foundation?

I know nothing of makeup tho, just describing my observations.

the_duke
0 replies
2d23h

I don't think it's hallucinating too much.

The nails have nail polish in the original, and the lips also look like they have at least lip gloss or a somewhat more muted lipstick.

satvikpendem
0 replies
2d22h

From Plato's dialogue Phaedrus 14, 274c-275b:

Socrates: I heard, then, that at Naucratis, in Egypt, was one of the ancient gods of that country, the one whose sacred bird is called the ibis, and the name of the god himself was Theuth. He it was who invented numbers and arithmetic and geometry and astronomy, also draughts and dice, and, most important of all, letters.

Now the king of all Egypt at that time was the god Thamus, who lived in the great city of the upper region, which the Greeks call the Egyptian Thebes, and they call the god himself Ammon. To him came Theuth to show his inventions, saying that they ought to be imparted to the other Egyptians. But Thamus asked what use there was in each, and as Theuth enumerated their uses, expressed praise or blame, according as he approved or disapproved.

"The story goes that Thamus said many things to Theuth in praise or blame of the various arts, which it would take too long to repeat; but when they came to the letters, "This invention, O king," said Theuth, "will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered." But Thamus replied, "Most ingenious Theuth, one man has the ability to beget arts, but the ability to judge of their usefulness or harmfulness to their users belongs to another; and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess.

"For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise."

bbstats
0 replies
2d23h

looks pretty clearly like she has makeup/lipstick on in the un-processed video to me.

MrNeon
0 replies
2d23h

Seems to be stock footage, is it surprising makeup would be involved?

constantcrying
14 replies
3d3h

The first demo on the page alone shows that it is a huge failure. It clearly changes the expression of the person.

Yes, it is impressive, but it's not what you want to actually "enhance" a movie.

turnsout
11 replies
3d3h

I agree that it's not perfect, though it does appear to be SoTA. Eventually something like this will just be part of every video codec. You stream a 480p version and let the TV create the 4K detail.

constantcrying
7 replies
3d3h

Why would you ever do that?

If you have the high res data you can actually compress the details which are there and then recreate them. No need to have those be recreated, when you actually have them.

Downscaling the images and then upscaling them is pure insanity when the high res images are available.

turnsout
6 replies
3d3h

So streaming services can save money on bandwidth

constantcrying
5 replies
3d2h

That's absurd. I think anybody is aware that it is far superior to e.g. compress in the frequency domain than to down sample your image. If you don't believe me just compare a JPEG compressed image with the same image of the same size compressed with down sampling. You will notice a literal night and day difference.

Down sampling is a bad way to do compression. It makes no sense to do NN reconstruction on that if you could have compressed that image better and reconstructed from that data.

acuozzo
4 replies
3d2h

An image downscaled and then upscaled to its original size is effectively low-pass filtered where the degree of edge preservation is dictated by the kernel used in both cases.

Are you saying low-pass filtering is bad for compression?

itishappy
1 replies
2d19h

The word is "blur." Low-pass filtering is blurring.

Is blurring good for compression? I don't know what that means. If the image size (not the file size) is held constant, a blurry image and a clear image take up exactly the same amount space in memory.

Blurring is bad for quality. Our vision is sensitive to high-frequency stuff, and low-pass filtering is by definition the indiscriminate removal of high-frequency information. Most compression schemes are smarter about the information they filter.

acuozzo
0 replies
20h47m

Is blurring good for compression? I don't know what that means.

Consider lossless RLE compression schemes. In this case, would data with low or high variance compress better?

Now consider RLE against sets of DCT coefficients. See where this is going?

In general, having lower variance in your data results in better compression.

Our vision is sensitive to high-frequency stuff

Which is exactly why we pick up HF noise so well! Post-processing houses are very often presented with the challenge of choosing just the right filter chain to maximize fidelity under size constraint(s).

low-pass filtering is by definition the indiscriminate removal of high-frequency information

It's trivial to perform edge detection and build a mask to retain the most visually-meaningful high frequency data.

constantcrying
1 replies
3d2h

Do you seriously think down sampling is superior to JPEG?

acuozzo
0 replies
21h5m

No. I never made this claim. My argument is pedantic.

jsheard
0 replies
3d2h

Not really, DLAA and the current incarnation of DLSS are temporal techniques, meaning all of the detail they add is pulled from past frames. That's an approach which only really makes sense in games where you can jitter the camera to continuously generate samples at different subpixel offsets with each frame.

The OP has more in common with the defunct DLSS 1.0, which tried to infer extra detail out of thin air rather than from previous frames, without much success in practice. That was like 5 years ago though so maybe the idea is worth revisiting at some point.

itishappy
0 replies
2d18h

Your video codec should never create a 480p version at all. Downsampling is incredibly lossy. Instead stream the internal state of your network directly, effectively using the network to decompress your video. Train a new network to generate this state, acting as a compressor. This is the principle of neural compression.

This has two major benefits:

1. You cut out the low resolution half your network entirely. (Go check out the architecture diagram of the original post.)

2. Your encoder network now has access to the original HD video, so it can choose to encode the high-frequency details directly instead of generating them afterwards.

philipov
1 replies
3d3h

It doesn't change the expression - the animated gifs are merely out of sync.

This appears to happen because they begin animating as soon as they finish loading, which happens at different times for each side of the image.

mlyle
0 replies
3d2h

Reloading can get them in sync. But, it seems to stop playback of the "left" one if you drag the slider completely left, which makes it easy to get desynced again.

herculity275
13 replies
3d2h

Wonder how long until Hollywood CGI shops have these types of models running as part of their post-production pipeline. Big blockbusters often release with ridiculously broken CGI due to crunch (Black Panther's third act was notorious for looking like a retro video-game), adding some extra generative polish in those cases is a no-brainer.

whywhywhywhy
6 replies
3d

Once AI tech gets fully integrated entire Hollywood rendering pipeline will go from rendering to diffusing

imiric
5 replies
2d23h

Once AI tech gets fully integrated, the movie industry will cease to exist.

Version467
4 replies
2d22h

Hollywood has incredible financial and political power. And even if fully AI generated movies reach the same quality (both visually and story wise) as current ones, there’s a lot of value in the shared experience of watching the same movies as other people, that a complete collapse of the industry seems highly unlikely to me.

shepherdjerred
2 replies
2d20h

that a complete collapse of the industry seems highly unlikely to me.

Unlikely in the next 10 years or the next 100?

Version467
1 replies
1d23h

I wouldn’t have any confidence in any predictions I make 100 years into the future even if we didn’t have the current AI developments.

With that said, I’m pretty confident that the movie industry will exist in 10 years (maybe heavily tranformed, but still existing and still pretty big). If it’s still a big part of current popculture by then (vs obviously on its way out) then I’d expect a collapse of it to require a change that is not a result of AI proliferation, but something else entirely.

shepherdjerred
0 replies
1d22h

My point is that many talk about AI as though it's not going to evolve or get better. It's a mindset of "We don't need to talk about this because it won't happen tomorrow".

Realistically, AI being able to replace Hollywood is something that could happen in 20-50 years. That's within most people's lifetime.

rini17
0 replies
2d21h

What quality? Current industry movies are, for lack of better term, inbred. Sound too loud, washed out rigid color scheme, keeping attention of the audience captive at all costs. They already exclude large, more sensitive, part of population that hates all of this despite the shared experience. And AI is exceptionally good at further inbreeding to the extreme.

While of course it isn't impossible for any industry to reinvent itself, movie as an art form won't die....having doubts about where it's going.

j45
3 replies
3d2h

A few years if not less.

They will have huge budgets for compute and the makers of compute will be happy to absorb those budgets.

Cloud production was already growing but this will continue to accelerate it imho

lupusreal
2 replies
3d2h

Wasn't Hollywood an early adopter of advanced AI video stuff, w.r.t. de-aging old famous actors?

j45
0 replies
3d1h

Bingo. Except it looked like magic because the tech was so expensive and only available to them.

Limited access to the tech added some mystique to it too.

Just like digital cameras created a lot more average photographers, it pushed photography to a higher standard than just having access to expensive equipment.

inhumantsar
0 replies
3d1h

yeah and the only reason we don't see more of it was prohibitively expensive for all but basically Disney.

the compute budgets for basic run of the mill small screen 3D rendering and 2D compositing is already massive compared to most other businesses of a similar scale. the industry has been under paying their artists for decades too.

I'm willing to bet that as soon as unreal or adobe or whoever comes out with a stable diffusion like model that can be consistent across a feature length movie, they'll stop bothering with artists altogether.

why have an entire team of actual people in the loop when the director can just tell the model what they want to see? why shy away from revisions when the model can update colour grade or edit a character model throughout the entire film without needing to re-render?

londons_explore
0 replies
3d

generative polish

I don't think we're far away from models that are able to take video input of an almost finished movie and add the finishing touches.

Eg. make the lighting better, make the cgi blend in better, hide bits of set that ought to have been out of shot, etc.

anigbrowl
0 replies
2d23h

A couple of months.

k2xl
6 replies
3d2h

Would be neat to see this on much older videos (maybe WW2 era) to see how it improves details.

bberrry
4 replies
3d2h

You mean _invents_ details.

djfdat
3 replies
3d2h

You mean _infers_ details.

ta8645
1 replies
3d2h

Or, extracts from its digital rectum?

reaperman
0 replies
3d2h

*logit

itishappy
0 replies
3d1h

What's the distinction?

loudmax
0 replies
2d23h

That is essentially what Peter Jackson did for the 2018 film They Shall Not Grow Old: https://www.imdb.com/title/tt7905466/

They used digital upsampling techniques and colorization to make World War One footage into high resolution. Jackson would later do the same process for the 2021 series Get Back, upscaling 16mm footage of the Beatles taken in 1969: https://www.imdb.com/title/tt9735318/

Both of these are really impressive. They look like they were shot on high resolution film recently, instead of fifty or a hundred years ago. It appears that what Peter Jackson and his team did meticulously at great effort can now be automated.

Everyone should understand the limitations of this process. It can't magically extract details from images that aren't there. It is guessing and inventing details that don't really exist. As long as everyone understands this, it shouldn't be a problem. Like, we don't care that the cross-stitch on someone's shirt in the background doesn't match reality so long as it's not an important detail. But if you try to go Blade Runner/CSI and extract faces from reflections of background objects, you're asking for trouble.

rowanG077
5 replies
3d2h

I am personally much more interested in frame rate upscalers. A proper 60Hz just looks much better then anything. Also would really, really like to see a proper 60Hz animate upscale. Anything in that space just sucks. But when in the rare cases it works it really looks next level.

fwip
3 replies
3d

Frame-rate upscaling is fine for video, but for animation it's awful.

I think it's almost inherently so, because of the care that an artist takes in choosing keyframes, deforming the action, etc.

rowanG077
2 replies
2d17h

This just sounds like the AI is just not good enough yet. I mean it's pretty clear now that there is nothing stopping AI from producing close to or sometimes even exceeding human artists. A big problem here is good training material

fwip
1 replies
2d2h

Baffling to me that you think that AI art is capable of "exceeding" the human training material.

rowanG077
0 replies
2d1h

I didn't say that. I said that AI is capable of sometimes exceeding human artists. That is not the same thing as saying AI is exceeding the best human artist. If your training material is of high quality it's shouldn't be impossible to exceed human artists some or even most times, i.e. Produce better material then the average or good artists.

whywhywhywhy
0 replies
3d

Have you tried DAIN?

jack_riminton
5 replies
3d2h

Another boon for the porn industry

esafak
3 replies
3d2h

Why, so they can restore old videos? I can't see much demand for that.

jack_riminton
0 replies
2d23h

Ok then?

falcor84
0 replies
3d1h

"I can't see much" - that's the demand

duskwuff
0 replies
2d20h

There are a lot of old porn videos out there which have become commercially worthless because they were recorded at low resolutions (e.g. 320x240 MPEG, VHS video, 8mm film, etc). Being able to upscale them to HD resolutions, at high enough quality that consumers are willing to pay for it, would be a big deal.

(It doesn't hurt that a few minor hallucinations aren't going to bother anyone.)

falcor84
0 replies
3d1h

History demonstrates that what's good for porn is generally good for society.

Aissen
5 replies
3d2h

This is great for entertainment (and hopefully the main application), but we need clear marking of such type of videos before hallucinated details are used as "proofs" of any kind by people not knowing how this works. Software video/photography on smartphones is already using proprietary algorithms that "infer" non-existent or fake details, and this would be at an even bigger scale.

staminade
1 replies
3d1h

Funny to think of all those scenes in TV and movies when someone would magically "enhance" a low-resolution image to be crystal clear. At the time, nerds scoffed, but now we know they were simply using an AI to super-scale it. In retrospect, how many fictional villains were condemned on the basis of hallucinated evidence? :-D

jsheard
0 replies
3d1h

Enemy of the State (1998) was prescient, that had a ridiculous example of "zoom and enhance" where they move the camera, but they hand-waved it as the computer "hypothesizing" what the missing information might have been. Which is more or less what gaussian splat 3D reconstructions are doing today.

matsemann
0 replies
2d21h

Yeah I was curious about that baby. Do they know how it looks, or just guess? What about the next video with the animals. The leaves on the bush, are they matching a tree found there, or just generic leaves perhaps from the wrong side of the world?

I guess it will be like people pointing out bird sounds in movies, that those birds don't exist in that country.

geor9e
4 replies
3d2h

This is great. I look forward to when cell phones run this at 60fps. It will hallucinate wrong, but pixel perfect moons and license plate numbers.

1970-01-01
2 replies
3d2h

Just get a plate with 'AAAAA4' and blame everything on 'AAAAAA'

xyst
0 replies
3d

So that’s why I don’t get toll bills.

MetaWhirledPeas
0 replies
2d22h

I look forward to VR 360 degree videos using something like this to overcome their current limitations, assuming the limit is on the capture side.

fladd
3 replies
2d22h

What exactly does this do? They have examples with a divider in the middle that you can move around and one side says "input" and the other "output". However, no matter where I move the slider, both sides look identical to me. What should I be focusing on exactly to see a difference?

7734128
2 replies
2d21h

It has clearly just loaded incorrectly for you (or you need glasses desperately). The effect is significant.

fladd
1 replies
2d21h

Tried again, same result. This is what I get: https://imgur.com/CvqjIhy

(And I already have glasses, thank you).

7734128
0 replies
9h28m

That's an error in your browser. It's not supposed to look like that.

tambourine_man
2 replies
2d23h

Videos autoplay in full screen as I scroll in mobile. Impressive tech, but could use better mobile presentation

can16358p
1 replies
2d19h

Yup, same here (iPhone Safari). They go fullscreen and can't dismiss them (they expand again) unless I try it very fast a few times.

tambourine_man
0 replies
2d17h

Terrible viewing experience

kfarr
2 replies
3d2h

This is amazing and all but at what point do we reach the point of there is no more “real” data to infer from low resolution? In other words there are all sorts of information theory research on the amount of unique entropy on a given medium and even with compression there is a limit. How does that limit relate to work like this? Is there a point at which it can say we know it’s inventing things beyond x scaling constant because of information theory research?

itishappy
0 replies
3d1h

This is amazing and all but at what point do we reach the point of there is no more “real” data to infer from low resolution?

The start point. Upscaling is by definition creating information where there wasn't any to begin with.

Nearest neighbor filtering is technically inventing information, it's just the dumbest possible approach. Bilinear filtering is slightly smarter. This approach tries to be smarter still by applying generative AI.

incorrecthorse
0 replies
3d2h

I'm not sure information theory deals with this question.

Since this isn't lossless decompression, the point of having no "real" data is already reached. It _is_ inventing things, and the only relevant question is how plausible are the things being invented; in other words, if the video also existed in higher resolution, how close would it actually look like the inferred version. Seems obvious that this metric increases as a function of the amount of information from the source, but I would guess the exact relationship is a very open question.

itishappy
2 replies
3d2h

It's impressive, but still looks kinda bad?

I think the video of the camera operator on the ladder shows the artifacts the best. The main camera equipment is no longer grounded in reality, with the fiddly bits disconnected from the whole and moving around. The smaller camera is barely recognizable. The plant in the background looks blurry and weird, the mountains have extra detail. Finally, the lens flare shifts!

Check out the spider too, the way the details on the leg shift is distinctly artificial.

I think the 4x/8x expansion (16x/64x the pixels!) is pushing the tech too far. I bet it would look great at <2x.

goggy_googy
0 replies
2d23h

I think the hand running through the wheat (?) is pretty good, object permanence is pretty reasonable especially considering the GAN architecture. GANs are good at grounded generation--this is why the original GigaGAN paper is still in use by a number of top image labs. Inferring object permanence and object dynamics is pretty impressive for this structure.

Plus, a rather small data set: REDS and Vimeo-90k aren't massive in comparison to what people speculate Sora was trained on.

Jackson__
0 replies
3d

I think the 4x/8x expansion (16x/64x the pixels!) is pushing the tech too far. I bet it would look great at <2x.

I believe this applies to every upscale model released in the past 8 years, yet undeterred by this scientists keep pushing on, sometimes even claiming 16x upscaling. Though this might be the first one that is pretty close to holding up at 4x in my opinion, which is not something I've seen often.

therealmarv
1 replies
3d

When do I have that in my Nvidia Shield? I would pay $$$ to have that in real-time ;)

bick_nyers
0 replies
2d17h

Are you using your Shield as an HTPC? In that case you can use the upscaler built into a TV. I prefer my LG C2 upscale (particularly the frame interpolation) compared to most Topaz AI upscales.

sys32768
1 replies
3d2h

Finally, we get to know whether the Patterson bigfoot film is authentic.

dguest
0 replies
3d2h

I can't wait for the next explosion in "bigfoot" videos: wildlife on the moon, people hiding in shadows, plants, animals, and structures completely out of place.

The difference will be that this time the images will be crystal clear, just hallucinated by a neural network.

sizzle
1 replies
2d22h

The video comparison examples, while impressive, were basically unusable on mobile Safari because they launched in full screen view and broke the slider UI.

can16358p
0 replies
2d19h

Yeah, and in my case they immediately went fullscreen again the moment I dismissed them, hijacking the browser.

peppertree
1 replies
3d2h

Have we reached peak image sensor size. Would it still make sense to shoot in fullframe when you can just upscale.

dguest
0 replies
3d2h

If you want to use your image for anything that needs to be factual (i.e. surveillance, science, automation) the up-scaling adds nothing---it's just guessing on what is probably there.

If you just want the picture to be pretty, this is probably cheaper than a bigger sensor.

cjensen
1 replies
3d

The video of the owl is a great example of doing a terrible job without the average Joe noticing.

The real owl has fine light/dark concentric circles on its face. The app turned it into gray because it does not see any sign of the circles. The real owl has streaks of spots. The app turned them into solid streaks because it saw no sign of spots. There's more where this came from, but basically only looks good to someone who has no idea what the owl should look like.

confused_boner
0 replies
2d22h

Is this considered a reincarnation of the 'rest of the owl' meme

aftbit
1 replies
3d2h

Can this take a crappy phone video of an object and convert that into a single high resolution image?

woctordho
0 replies
2d10h

Glad to see that Adobe is still investing on the alias-free convolutions (as in StyleGAN3), and this time they know how to fill the lost high frequency features

I always thought that alias-free convolutions can produce much more natural videos

vim-guru
0 replies
2d11h

How long until we can have this run real-time in the browser? :D

thrdbndndn
0 replies
2d9h

is there code/model available to try out?

softfalcon
0 replies
3d

I'm curious as to how well this works when upscaling from 1080p to 4K or 4K to 8K.

Their 128x128 to 1024x1024 upscales are very impressive, but I find the real artifacts and weirdness are created when AI tries to upscale an already relatively high definition image.

I find it goes haywire, adding ghosting, swirling, banded shadowing, etc as it whirlwinds into hallucinations from too much source data since the model is often trained to work with really small/compressed video into an "almost HD" video.

smokel
0 replies
2d22h

It seems likely that our brains are doing something similar.

I remember being able to add a lot of detail to the monsters that I could barely make out amidst the clothes piled up on my bedroom floor.

skerit
0 replies
3d2h

No public model available yet? Would love to test and train it on some of my datasets.

sheepscreek
0 replies
2d18h

Very impressive demonstration but a terrible mobile experience. For the record, I am using iOS Safari.

sciencesama
0 replies
2d23h

Show me the code

rjmunro
0 replies
3d

I wonder if you could specialise a model by training it on a whole movie or TV series, so that instead of hallucinating from generic images, the model generates things it has seen closer-up in other parts of the movie.

You'd have to train it to go from a reduced resolution to the original resolution, then apply that to small parts of the screen at the original resolution to get an enhanced resolution, then stitch the parts together.

renewiltord
0 replies
3d1h

Wow, the results are amazing. Maintaining temporal consistency was just the beginning part. Very cool.

petermcneeley
0 replies
2d19h

I wonder how well this works with compressed video. The low res input video looks to be raw uncompressed.

nullsmack
0 replies
2d3h

I was hoping this would be an open-source video upscale until I saw it was from Adobe.

kouru225
0 replies
3d1h

I need to learn how to use these new models

jiggawatts
0 replies
2d18h

Something I've been thinking about recently is a more scalable approach to video super-resolution.

The core problem is that any single AI will learn how to upscale "things in general", but won't be able to take advantage of inputs from the source video itself. E.g.: a close-up of a face in one scene can't be used elsewhere to upscale a distant shot of the same actor.

Transformers solve this problem, but with quadratic scaling, which won't work any time soon for a feature-length movie. Hence the 10 second clips in most such models.

Transformers provide "short term" memory, and the base model training provides "long term" memory. What's needed is medium-term memory. (This is also desirable for Chat AIs, or any long-context scenario.)

LoRA is more-or-less that: Given input-output training pairs it efficiently specialises the base model for a specific scenario. This would be great for upscaling a specific video, and would definitely work well in scenarios where ground-truth information is available. For example, computer games can be rendered at 8K resolution "offline" for training, and then can upscale 2K to 4K or 8K in real time. NVIDIA uses this for DLSS in their GPUs. Similarly, TV shows that improved in quality over time as the production company got better cameras could use this.

This LoRA fine-tuning technique obviously won't work for any single movie where there isn't high-resolution ground truth available. That's the whole point of upscaling: improving the quality where the high quality version doesn't exist!

My thought was that instead of training the LoRA fine-tuning layers directly, we could train a second order NN that outputs the LoRA weights! This is called a HyperNet, which is the term for neural networks that output neural networks. Simply put: many differentiable functions are twice (or more) differentiable, so we can minimise a minimisation function... training the trainer, in other words.

The core concept is to train a large base model on general 2K->4K videos, and then train a "specialisation" model that takes a 2K movie and outputs a LoRA for the base model. This acts as the "medium term" memory for the base model, tuning it for that specific video. The base model weights are the "long term" memory, and the activations are its "short term" memory.

I suspect (but don't have access to hardware to prove) that approaches like this will be the future for many similar AI tasks. E.g.: specialising a robot base model to a specific factory floor or warehouse. Or specialising a car driving AI to local roads. Etc...

imhereforwifi
0 replies
2d22h

This looks great, however, things like rolling shutter or video wipes/transitions will be interesting to see how it handles that. Also, all of the sample videos the camera is locked down and not moving, or moving just ever so slightly (the ants and the car clips ). It looks like they took time to smooth out any excessive camera shake.

Intergrading this with Adobe's object tracking software (in premier/after effects) may help.

hellofellows
0 replies
3d

hmm is there something more specifically for lecture videos? I'm tired of watching lectures in 480p...

forgingahead
0 replies
3d2h

No code?

esaym
0 replies
3d1h

Ok, how do I download it and use it though???

cynicalpeace
0 replies
2d23h

We need to input UFO vids into this ASAP to get a better guess as what some of those could be.

adzm
0 replies
3d1h

Curious how this compares to Topaz which is the current industry leader in the field.

IncreasePosts
0 replies
3d2h

I find it interesting how it changed the bokeh from a octagon to a circular bokeh

1shooner
0 replies
2d22h

This seems technically very impressive, but it does occur to my more pragmatic side that I probably haven't seen videos as blurry as the inputs for ~ 10 years. I'm sure I'm unaware of important use cases, but I didn't realize video resolution was a thing we needed to solve for these days (at least inference for perceptive quality).