"return to table of content"

Stable Video Diffusion

btbuildem
99 replies
1d3h

In the video towards the bottom of the page, there are two birds (blue jays), but in the background there are two identical buildings (which look a lot like the CN Tower). CN Tower is the main landmark of Toronto, whose baseball team happens to be the Blue Jays. It's located near the main sportsball stadium downtown.

I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

TacticalCoder
64 replies
1d2h

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

I feel like we're close too, but for another reason.

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.

Wanna move that bicycle? Move it in the 3D scene exactly where you want.

That is coming.

And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?

That is coming too.

epr
26 replies
1d2h

you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I'm always confused why I don't hear more about projects going in this direction. Controlnets are great, but there's still quite a lot of hallucination and other tiny mistakes that a skilled human would never make.

boppo1
13 replies
1d1h

Blender files are dramatically more complex than any image format, which are basically all just 2D arrays of 3-value vectors. The blender filetype uses a weird DNA/RNA struct system that would probably require its own training run.

More on the Blender file format: https://fossies.org/linux/blender/doc/blender_file_format/my...

mikepurvis
10 replies
1d1h

But surely you wouldn't try to emit that format directly, but rather some higher level scene description? Or even just a set of instructions for how to manipulate the UI to create the imagined scene?

numpad0
3 replies
18h0m

It sure feels weird to me as well, that GenAI is always supposed to be end-to-end with everything done inside NN blackbox. No one seems to be doing image output as SVG or .ai.

HammadB
1 replies
13h58m

There is a fundamental disconnect between industry and academia here.

maccard
0 replies
11h25m

Over the last 10 years of industry work, I'd say about 20% of my time has been format shifting, or parsing half baked undocumented formats that change when I'm not paying attention.

That pretty much matches my experience working with NN's and LLM's

metanonsense
0 replies
13h17m

Imo the thinking is that whenever humans have tried to pre-process or feature-engineer a solution or tried to find clever priors in the past, massive self-supervised-learning enabled, coarsely architected, data-crunching NNs got better results in the end. So, many researchers / industry data scientists may just be disinclined to put effort into something that is doomed to be irrelevant in a few years. (And, of course, with every abstraction you will lose some information that may bear more importance than initially thought)

mikebelanger
3 replies
19h30m

Yeah I'd imagine that's the best way. Lots of LLMs can generate workable Python code too, so code that jives with Blender's Python API doesn't seem like too much of a leap.

The only trick is that there has to be enough Blender Python code to train the LLM on.

arcticbull
2 replies
18h45m

Maybe something like OpenSCAD is a good middle ground. Procedural code-like format for specifying 3D objects that can then be converted and imported in Blender.

lightedman
1 replies
16h32m

I tried all the AI stuff that I could on OpenSCAD.

While it generates a lot of code that initially makes sense, when you use the code, you get a jumbled block.

regularfry
0 replies
13h56m

This. I think problem is that the LLMs really struggle with 3d scene understanding, so what you would need to do is generate code that generates code.

But also I suspect there just isn't that much openscad code in the training data, and the semantics are different enough to python or any of the other languages that are well-represented that it struggles.

BirdieNZ
1 replies
1d1h

I've seen this but producing Python scripts that you run in Blender, e.g. https://www.youtube.com/watch?v=x60zHw_z4NM (but I saw something marginally more impressive, not sure where though!)

bsenftner
0 replies
10h1m

My god that is an irritating video style, "AI woweee!"

guyomes
0 replies
1d

Voxel files could be a simpler step for 3D images.

Keyframe
0 replies
1d1h

Scene layouts, models and their attributes are a result of user input (ok and sometimes program output). One avenue to take there would be to train on input expecting an output. Like teaching a model to draw instead of generate images.. which in a sense we already did by broadly painting out silhouettes and then rendering details.

lairv
3 replies
23h5m

I think the bottleneck is data

For single 3D object the biggest dataset is ObjaverseXL with 10M samples

For full 3D scenes you could at best get ~1000 scenes with datasets like ScanNet I guess

Text2Image models are trained on datasets with 5 billion samples

bsenftner
2 replies
9h55m

Oh, I don't know about that. Working in feature film animation, studios have gargantuan model libraries from current and past projects, with a good number (over half) never used by a production but created as part of some production's world building. Plus, generative modeling has been very popular for quite a few years. I don't think getting more 3D models then they could use is a real issue for anyone serious.

senseiV
1 replies
5h45m

Where can you find those? I'm in the same situation as him, I've never heard of a 3d dataset better than objaverse XL.

Got a public dataset?

bsenftner
0 replies
4h59m

These are not public datasets, but with some social engineering I bet one could get access.

I've not worked in VFX for a while, but when I did the modeling departments at multiple studios had giant libraries of completed geometries for every project they ever did, plus even larger libraries of all the pieces and parts they use as generic lego geometry whenever they need something new.

Every 3D modeler I know has their own personal libraries of things they'd made as well as their own "lego sets" of pieces and parts and generative geometry tools they use when making new things.

Now this is just a guess, but do you know anyone going through one of those video game schools? I wager the schools have big model libraries for the students as well. Hell, I bet Ringling and Sheridan (the two Harvards of Animation) have colossally sized model libraries for use by their students. Contact them.

jowday
2 replies
1d

There's a lot of issues with it, but perhaps the biggest is that there aren't just troves of easily scrapable and digestible 3D models lying around on the internet to train on top of like we have with text, images, and video.

Almost all of the generative 3D models you see are actually generative image models that essentially (very crude simplification) perform something like photogrammetry to generate a 3D model - 'does this 3D object, rendered from 25 different views, match the text prompt as evaluated by this model trained on text-image pairs'?

This is a shitty way to generate 3D models, and it's why they almost all look kind of malformed.

sterlind
1 replies
1d

If reinforcement learning were farther along, you could have it learn to reproduce scenes as 3D models. Each episode's task is to mimic an image, each step is a command mutating the scene (adding a polygon, or rotating the camera, etc.), and the reward signal is image similarity. You can even start by training it with synthetic data: generate small random scenes and make them increasingly sophisticated, then later switch over to trying to mimic images.

You wouldn't need any models to learn from. But my intuition is that RL is still quite weak, and that the model would flounder after learning to mimic background color and placing a few spheres.

skdotdan
0 replies
1d
sanitycheck
0 replies
9h38m

From my very clueless perspective, it seems very possible to train an AI to use Blender to create images in a mostly unsupervised way.

So we could have something to convert AI-generated image output into 3D scenes without having to explicitly train the "creative" AI for that.

Probably much more viable, because the quantity of 3D models out in the wild is far far lower than that of bitmap images.

insanitybit
0 replies
22h23m

I assume because it's still extremely early.

eigenvalue
0 replies
20h0m

I think this recent Gaussian Splatting technique could end up working really well for generative models, at least once there is a big corpus of high quality scenes to train on. Seems almost ideal for the task because it gets photorealistic results from any angle, but in a sparse, data efficient way, and it doesn’t require a separate rendering pipeline.

dragonwriter
0 replies
23h31m

I'm always confused why I don't hear more about projects going in this direction.

Probably because they aren't as advanced and the demos aren't as impressive to nontechnical audiences who don't understand the implications: there’s lots of work on text-to-3d-model generation, and even plugins for some stable diffusion UIs (e.g., MotionDiff for ComyUI.)

bozhark
0 replies
1d1h

One was on the front page the other day, I’ll search for a link

atentaten
11 replies
1d2h

Whats your reasoning for feeling that we're close?

cptaj
10 replies
1d2h

We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.

It can learn anything you have data for.

Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?

jncfhnb
4 replies
23h40m

Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.

We’ll need a new approach for this kind of problem

dragonwriter
3 replies
23h35m

Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex

3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)

jncfhnb
2 replies
21h53m

As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.

dragonwriter
1 replies
21h50m

Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.

jncfhnb
0 replies
20h39m

Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.

boppo1
3 replies
1d1h

3D scene file format is no different

Not in theory, but the level of complexity is way higher and the amount of data available is much smaller.

Compare bitmaps to this: https://fossies.org/linux/blender/doc/blender_file_format/my...

kaibee
2 replies
1d1h

Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.

astrange
1 replies
20h47m

You can do "constrained decoding" on a code model which keeps it grammatically correct.

But we haven't gotten diffusion working well for text/code, so generating long files is a problem.

DougBTX
0 replies
9h58m

Recent results for code diffusion here: https://www.microsoft.com/en-us/research/publication/codefus...

I'm not experienced enough to validate their claims, but I love the choice of languages to evaluate on:

Python, Bash and Excel conditional formatting rules.
dragonwriter
0 replies
23h36m
p1esk
4 replies
1d2h

Are you working on all that?

cptaj
3 replies
1d2h

Probably not. But there does seem to be a clear path to it.

The main issue is going to be having the right dataset. You basically need to record user actions in something like blender (ie: moving a model of a bike to the left of a scene), match it to a text description of the action (ie; "move bike to the left") and match those to before/after snapshots of the resulting file format.

You need a whole metric fuckton of these.

After that, you train your model to produce those 3d scene files instead of image bitmaps.

You can do this for a lot of other tasks. These general purpose models can learn anything that you can usefully represent in data.

I can imagine AGI being, at least in part, a large set of these purpose trained models. Heck, maybe our brains work this way. When we learn to throw a ball, we train a model in a subset of our brain to do just this and then this model is called on by our general consciousness when needed.

Sorry, I'm just rambling here but its very exciting stuff.

sterlind
1 replies
1d

The hard part of AGI is the self-training and few examples. Your parents didn't attach strings to your body and puppeteer you through a few hundred thousand games of baseball. And the humans that invented baseball had zero training data to go on.

p1esk
0 replies
21h33m

Your body is a result of a billion year old evolutionary optimization process. GPT-4 was trained from scratch in a few months.

filipezf
0 replies
22h48m

I have for some time planning to do a 'Wikipedia for AI' (even bought a domain), where people could contribute all sorts of these skills ( not only 3d video, but also manual skills, or anything). Given the current climate of 'AI will save/doom us', and that users would in some sense be training their own replacements, I don't know how much love such site would have, though.

coldtea
4 replies
1d

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

The question is whether the 99% of the audience would even care...

COAGULOPATH
3 replies
17h41m

Of course they would. The internet spent a solid month laughing at the Sonic the Hedgehog movie because Sonic had weird-looking teeth.

coldtea
1 replies
10h41m

Since that movie did well and spawned 2 sequels, the real conclusion is that the viewers didn't really care.

As for "the internet", there will always some small part of it which will obsess and/or laught over anything, doesn't mean they represent anything significant - not even when they're vocal.

PawgerZ
0 replies
8h17m

Viewers did care: the teeth got changed before the movie was released. And, I don't know if you missed it, but it wasn't just one niche of the internet commenting on his teeth. The "outrage" went mainstream; even dentists were making hit-pieces on Sonic's teeth. I'm not gonna lie, it was amazing marketing for the movie, intentional or not.

ekianjo
0 replies
12h27m

No they laughed at it because it looked awful in every single way

whywhywhywhy
3 replies
8h24m

Nah I disagree, this feels like a glorification of the process not the end result. Just because having the 3D model in the scene with all the lighting makes the end result feel more solid to you because you feel you can see the work that's going into it.

In the end diffusion technology can make a more realistic image faster than a rendering engine can.

I feel pretty strongly that this pipeline will be the foundation for most of the next decade of graphics and making things by hand in 3D will become extremely niche because lets face it anyone who has worked in 3D it's tedious, it's time consuming, takes large teams and it's not even well paid.

The future is just tools that give us better controls and every frame will be coming from latent space not simulated photons.

I say this as someone who had done 3D professionally in the past.

bbor
1 replies
5h58m

I find that very unlikely. LLMs seem capable of simulating human intuition, but not great at simulating real complex physics. Human intuition of how a scene “should” look isn’t always the effect you want to create, and is rarely accurate im guessing

dragonwriter
0 replies
5h45m

LLMs seem capable of simulating human intuition, but not great at simulating real complex physics.

Diffusion models aren't LLMs (they may use something similar as their text encoder layer) and they simulate their training corpus, which usually isn't selected solely for physical fidelity, because that's not actually the single criteria for visual imagery outside of what is created by diffusion models.

pegasus
0 replies
6h22m

Nah, I agree with GP. Who didn't suggest making 3D scenes by hand, but the opposite: create those 3D scenes using the generative method, use ray-tracing or the like to render the image. Maybe have another pass through a model to apply any touch-ups to make it more gritty and less artificial. This way things can stay consistent and sane, avoiding all those flaws which are so easy to spot today.

jwoodbridge
2 replies
17h57m

we're working on this if you want to give it a try - dream3d.com

hackerlight
1 replies
15h52m

You should put a demo on the landing page

jwoodbridge
0 replies
9h36m

just redid the ux and making a new one, but here's a quick example: https://www.loom.com/share/fa84ba92d7144179ac17ece9bf7fbd99

wruza
0 replies
13h2m

Not that I’m against the described 3d way, but personally I don’t care about light and shadows until it’s so bad that I do. This obsession with realism is irrational in video games. In real life people don’t understand why light works like this or like that. We just accept it. And if you ask someone to paint how it should work, the result is rarely physical but acceptable. It literally doesn’t matter until it’s very bad.

solarkraft
0 replies
21h50m

Where is the training data coming from?

sheepscreek
0 replies
23h52m

Excellent point.

Perhaps a more computationally expensive but better looking method will be to pull all objects in the scene from a 3D model library, then programmatically set the scene and render it.

internet101010
0 replies
1d1h

I am guessing it will be similar to inpainting in normal stable diffusion, which is easy when using the workflow feature InvokeAI ui.

btbuildem
0 replies
8h34m

That indeed sounds like a very plausible solution -- working with AI on the level of scene definitions, model geometries etc.

However, 3D is just one approach to rendering visuals. There are so many other styles and methods how people create images, and if I understand correctly, we can do image-to-text to analyze image content, as well as text-to-image to generate it - regardless of the orginal method (3d render or paintbrush or camera lens). There are some "fuzzy primitives" in the layers there that translate to the visual elements.

I'm hoping we see "editors" that let us manipulate / edit / iterate over generated images in terms of those.

bob1029
0 replies
1d2h

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I agree with this philosophy - Teach the AI to work with the same tools the human does. We already have a lot of human experts to refer to. Training material is everywhere.

There isn't a "text-to-video" expert we can query to help us refine the capabilities around SD. It's a one-shot, Jupiter-scale model with incomprehensible inertia. Contrast this with an expert-tuned model (i.e. natural language instructions) that can be nuanced precisely and to the the point of imperceptibility with a single sentence.

The other cool thing about the "use existing tools" path is that if the AI fails part way through, it's actually possible for a human operator to step in and attempt recovery.

a_bouncing_bean
0 replies
1d1h

Thanks! this is exactly what I have been thinking, only you've expressed it much more eloquently than I would be able.

Kuinox
0 replies
23h42m

This isn't coming, it's already here. https://github.com/gsgen3d/gsgen Yes, it's just 3D models for now, but it can do whole scenes generations, it's just not great yet at it. The tech is there but just need to improve.

psunavy03
6 replies
1d2h

sportsball

This is not the flex you think it is. You don't have to like sports, but snarking on people who do doesn't make you intellectual, it just makes you come across as a douchebag, no different than a sports fan making fun of "D&D nerds" or something.

Zetaphor
2 replies
1d2h

This has become a colloquial term for describing all sports, not the insult you're perceiving it to be.

Rather than projecting your own hangups and calling people names, try instead assuming that they're not trying to offend you personally and are just using common vernacular.

achileas
1 replies
20h51m

If only there was an existing way to refer to sports generally! And OP was referring to a specific sport (baseball), not sports generally.

btbuildem
0 replies
8h25m

The Rogers Centre hosts baseball, football, and basketball games - so in this case "sportsball" was just a shorthand for all these ball sports.

jojobas
0 replies
15h12m

Would you get incensed by "petrolhead", "greenfingers" or "trekkie"? Is that what you choose to be emotional about?

chaps
0 replies
1d2h

Ah, Mr. Kettle, I see you've met my friend, Mr. Pot!

callalex
0 replies
4h57m

You’re really not helping the “sports fans are combative thugs” stereotype by going off on an insult tirade over an innocent word.

appplication
4 replies
1d2h

I don’t spend a lot of time keeping up with the space, but I could have sworn I’ve seen a demo that allowed you to iterate in the way you’re suggesting. Maybe someone else can link it.

ssalka
1 replies
1d2h

My guess is you're thinking of InstructPix2Pix[1], with prompts like "make the sky green" or "replace the fruits with cake"

[1] https://github.com/timothybrooks/instruct-pix2pix

appplication
0 replies
23h55m

This is exactly it!

tjoff
0 replies
1d2h
accrual
0 replies
1d2h

It's not exactly like GP described (e.g. move bike to the left) but there is a more advanced SD technique called inpainting [0] that allows you to manually recompose parts of the image, e.g. to fix bad eyes and hands.

[0] https://stable-diffusion-art.com/inpainting_basics/

kshacker
2 replies
1d2h

Assuming we can post links, you mean this video: https://youtu.be/G7mihAy691g?si=o2KCmR2Uh_97UQ0N

Also, maybe you can't edit post facto, but when you give prompts, would you not be able to say : two blue jays but no CN tower

FrozenTuna
1 replies
1d2h

Yes, its called a negative prompt. Idk if txt2video has it, but both llms and stable-diffusion have it so I'd assume its good to go.

nottheengineer
0 replies
1d2h

Haven't implemented negative prompts yet, but from what I can tell it's as simple as substracting from the prompt in embedding space.

dsmmcken
2 replies
19h51m

Adobe is doing some great work here in my opinion in terms of building AI tools that make sense for artist workflows. This "sneak peak" demo from the recent Adobe Max conference is pretty much exactly what you described, actually better because you can just click on an object in the image and drag it.

See video: https://www.adobe.com/max/2023/sessions/project-stardust-gs6...

thatoneguy
0 replies
5h21m

Makes me wonder if they train their data on everything anyone has ever uploaded to Creative Cloud.

btbuildem
0 replies
8h17m

Right, that's embedded directly into the existing workflow. Looks like a very powerful feature indeed.

xianshou
1 replies
1d2h

Emu edit should be exactly what you're looking for: https://ai.meta.com/blog/emu-text-to-video-generation-image-...

smcleod
0 replies
1d1h

It doesn’t look like the code for that is available anywhere though?

achileas
1 replies
20h49m

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.

btbuildem
0 replies
8h23m

I think that's more "inpainting" where the existing software solution uses AI to accelerate certain image editing tasks. I was looking for whole-image manipulation at the "conceptual" level.

01100011
1 replies
21h46m

I recently tried to generate clip art for a presentation using GPT-4/DALL-E 3. I found it could handle some updates but the output generally varied wildly as I tried to refine the image. For instance, I'd have a cartoon character checking its watch and also wearing a pocket watch. Trying to remove the pocket watch resulted in an entirely new cartoon with little stylistic continuity to the first.

Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.

btbuildem
0 replies
8h31m

Yes that's exactly what I'm referring to! It feels as if there is no context continuity between the attempts.

zeckalpha
0 replies
16h26m

I see that as a reference to the AI generated Toronto Blue Jays advertisement gone wrong that went viral earlier this year. https://www.blogto.com/sports_play/2023/06/ai-generated-toro...

treesciencebot
0 replies
1d2h

Have you seen fal.ai/dynamic where you can perform image to image synthesis (basically editing an existing image with the help of diffusion process) using LCMs to provide a real time UI?

stevage
0 replies
21h7m

I wondered similarly whether the astronaut's weird gait was because it was kind of "moonwalking" on the moon.

omneity
0 replies
7h31m

Nice eye!

As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].

There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.

[0]: https://emu-edit.metademolab.com/

[1]: https://emu-video.metademolab.com/

[2]: https://llava-vl.github.io/llava-interactive/

filterfiber
0 replies
1d2h

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Emu can do that.

The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence

amoshebb
0 replies
1d2h

I wonder what other odd connections are made due to city-name almost certainly being the most common word next to sportsball-name.

Do the parameters think that Jazz musicians are mormon? Padres often surf? Wizards like the Lincoln Memorial?

ProfessorZoom
0 replies
1d2h

that sounds like v0 by vercel, you can iterate just like you asked, to combine that type of iteration with video would be really awesome

JoshTriplett
0 replies
1d2h

I also wonder if the model takes capitalization into account. Capitalized "Blue Jays" seems more likely to reference the sports team; the birds would be lowercase.

FrozenTuna
0 replies
1d2h

Not exactly what you're asking for, but AnimateDiff has introduced creating gifs to SD. Still takes quite a bit of tweaking IME.

COAGULOPATH
0 replies
17h46m

they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Yeah. They're not "videos" so much as images that move around a bit.

This doesn't really look any better than those Midjourney + RunwayML videos we had half a year ago.

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Google has a model called Phenaki that supposedly allows for that kind of stuff. But the public can't use it so it's hard to say how good it actually is.

valine
38 replies
1d3h

The rate of progress in ML this past year has been breath taking.

I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.

alberth
21 replies
1d1h

What was the big “unlock” that allowed so much progress this past year?

I ask as a noob in this area.

mlboss
7 replies
1d1h

Stable diffusion open source release and llama release

alberth
6 replies
1d1h

But what technically allowed for so much progress?

There’s been open source AI/ML for 20+ years.

Nothing comes close to the massive milestones over the past year.

jasonjmcghee
1 replies
1d1h

People figuring out how to train and scale newer architectures (like transfomers) effectively, to be wildly larger than ever before.

Take AlexNet - the major "oh shit" moment in image classification.

It had an absolutely mind-blowing number of parameters at a whopping 62 million.

Holy shit, what a large network, right?

Absolutely unprecedented.

Now, for language models, anything under 1B parameters is a toy that barely works.

Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.

A whole lot of smart people had to do a bunch of cool stuff to be able to keep networks working at all at that size.

Many, many times over the years, people have tried to make larger networks, which fail to converge (read: learn to do something useful) in all sorts of crazy ways.

At this size, it's also expensive to train these things from scratch, and takes a shit-ton of data, so research/discovery of new things is slow and difficult.

But, we kind of climbed over a cliff, and now things are absolutely taking off in all the fields around this kind of stuff.

Take a look at XTTSv2 for example, a leading open source text-to-speech model. It uses multiple models in its architecture, but one of them is GPT.

There are a few key models that are still being used in a bunch of different modalities like CLIP, U-Net, GPT, etc. or similar variants. When they were released / made available, people jumped on them and started experimenting.

dragonwriter
0 replies
1d

Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.

SDXL is 6.6 billion.

mschuster91
0 replies
21h23m

But what technically allowed for so much progress?

The availability of GPU compute time. Up until the Russian invasion into Ukraine, interest rates were low AF so everyone and their dog thought it would be a cool idea to mine one or another sort of shitcoin. Once rising interest rates killed that business model for good, miners dumped their GPUs on the open market, and an awful lot of cloud computing capacity suddenly went free.

kmeisthax
0 replies
1d1h

Attention, transformers, diffusion. Prior image synthesis techniques - i.e. GANs - had problems that made it difficult to scale them up, whereas the current techniques seem to have no limit other than the amount of RAM in your GPU.

fragmede
0 replies
16h59m

the Transformers are all you need paper from Google, which may end up being a larger contribution to society than Google search, is foundational.

Emad Mostaque and his investment in stable diffusion, and his decision to release it to the world.

I'm sure there are others, but those are the two that stick out to me.

Chabsff
0 replies
1d1h

Public availability of large transformer-based foundation models trained at great expense, which is what OP is referring to, is definitely unprecedented.

4death4
6 replies
1d1h

I think these are the main drivers behind the progress:

- Unsupervised learning techniques, e.g. transformers and diffusion models. You need unsupervised techniques in order to utilize enough data. There have been other unsupervised techniques in the past, e.g. GANs, but they don't work as well.

- Massive amounts of training data.

- The belief that training these models will produce something valuable. It costs between hundreds of thousands to millions of dollars to train these models. The people doing the training need to believe they're going to get something interesting out at the end. More and more people and teams are starting to see training a large model as something worth pursuing.

- Better GPUs, which enables training larger models.

- Honestly the fall of crypto probably also contributed, because miners were eating a lot of GPU time.

mkaic
4 replies
1d1h

I don't think transformers or diffusion models are inherently "unsupervised", especially not the way they're used in Stable Diffusion and related models (which are very much trained in a supervised fashion). I agree with the rest of your points though.

ebalit
2 replies
1d1h

Generative methods have usually been considered unsupervised.

You're right that conditional generation start to blur the lines though.

n2d4
1 replies
22h43m

"Generative AI" is a misnomer; it's not the same kind of "generative" as the G in GAN.

While you're right about GANs, diffusion models as transformers as transformers are most commonly trained with supervised learning.

ebalit
0 replies
22h34m

I disagree. Diffusion models are trained to generate the probability distribution of their training dataset, like other generative models (GAN, VAE, etc). The fact that the architecture is a Transformer (or a CNN with attention like in Stable Diffusion) is orthogonal to the generative vs discriminative divide.

Unsupervised is a confusing term as there is always an underlying loss being optimized and working as a supervision signal, even for good old kmeans. But generative models are generally considered to be part of unsupervised methods.

valec
0 replies
17h17m

self-supervised is a better term

JCharante
0 replies
8h53m

The belief that training these models will produce something valuable

Exactly. The growth in the next decade is going to be unimaginable because now governments and MNCs believe that there realistically be progress made in this field.

Cyphase
1 replies
1d1h

One factor is that Stable Diffusion and ChatGPT were released within 3 months of each other – August 22, 2022 and November 3, 2022, respectively. That brought a lot of attention and excitement to the field. More excitement, more people, more work being done, more progress.

Of course those two releases didn't fall out of the sky.

JCharante
0 replies
8h52m

Dalle 2 also went viral around the same time

throwaway290
0 replies
22h12m

MS subsidizing it with 10 billions USD and (un)healthy contempt towards copyright.

password54321
0 replies
9h39m

There has been massive progress in ML every year since 2013, partly due to better GPUs and lots of training data. Many are only taking notice now that it is in products but it wasn't that long ago there was skepticism on HN even when software like Codex existed in 2021.

moritonal
0 replies
9h55m

Where do you want to start? The Internet collection and structuring the world's knowledge into a few key repositories? The focus on GPUs in gaming and then the crypto market creating a suite of libraries dedicated to hard scaling math. Or then the miniaturization and focus on energy efficiency due to phones making scaled training cost-effective. Finally the papers released by Google and co which didn't seem to recognise quite how easy it would be to build and replicate upon. Nothing was unlocked apart from a lot of people suddenly noticed how doable all this already was.

marricks
0 replies
23h0m

I mean, you probably didn't pay much attention to battery capacity before phones, laptops, and electric cars, right? Battery capacity has probably increased though at some rate before you paid attention. It's just when something actually becomes relevant that we notice.

Not that more advances don't happen with sustained hype, just there's some sort of tipping point involving usefulness based either on improvement of the thing in question or it's utility elsewhere.

Der_Einzige
13 replies
1d2h

Controlnet is adapted to video today, the issues are that it's very slow. Haven't you seen the insane quality of videos on civitai?

valine
7 replies
1d2h

I have seen them, the workflows to create those videos are extremely labor intensive. Control net lets you maintain poses between frames, it doesn’t solve the temporal consistency of small details.

mattnewton
6 replies
1d2h

People use animatediff’s motion module (or other models that have cross frame attention layers). Consistency is close to being solved.

dragonwriter
4 replies
1d2h

Temporal consistency is improving, but “close to being solved” is very optimistic.

mattnewton
3 replies
1d2h

No I think we’re actually close. My source is I’m working on this problem and the incredible progress of our tiny 3 person team at drip.art (http://api.drip.art) - we can generate a lot of frames that are consistent, and with interpolation between them, smoothly restyle even long videos. Cross-frame attention works for most cases, it just needs to be scaled up.

And that’s just for diffusion focused approaches like ours. There are probably other techniques from the token flow or nerf family of approaches close to breakout levels of quality, tons of talented researchers working on that too.

ryukoposting
1 replies
16h9m

The demo clips on the site are cool, but when you call it a "solved problem," I'd expect to see panning, rotating, and zooming within a cohesive scene with multiple subjects.

mattnewton
0 replies
4h35m

Thanks for checking it out! We’re certainly not done yet, but much of what you ask is possible or will be soon on the modeling side and we need tools to expose that to a sane workflow in traditional video editors.

Hard_Space
0 replies
15h4m

Once a video can show a person twisting round, and their belt buckle is the same at the end as it was at the start of the turn, it's solved. VFX pipelines need consistency. TC is a long, long way from being solved, except by hitching it to 3DMMs and SMPL models (and even then, the results are not fabulous yet).

valine
0 replies
1d2h

Hopefully this new model will be a step beyond what you can do with animatediff

capableweb
4 replies
1d2h

Haven't you seen the insane quality of videos on civitai?

I have not, so I went to https://civitai.com/ which I guess is what you're talking about? But I cannot find a single video there, just images and models.

capableweb
0 replies
6h20m

Not sure I'd call that "insane quality", more like neat prototypes. I'm excited where things will be in the future, but clearly it has a long way to go.

dragonwriter
0 replies
23h26m

A small percentage of the images are animations. This id (for obvious reasons) particularly common for images used on the catalog pages for animation-related tools and models, but also its not uncommon for (AnimateDiff-based, mostly) animations to be used to demo the output of other models.

adventured
0 replies
22h47m

https://civitai.com/images

Go there, in the top right of the content area it has two drop-downs: Most Reactions | Filters

Under filters, change the media setting to video.

Civitai has a notoriously poor layout for finding/browsing things unfortunately.

kornesh
0 replies
22h12m

Yeah, solving the flickering problem and achieving temporal consistency will be the key to realize the full potential of generative video models.

Right now, AnimateDiff is leading the way in consistency but I'm really excited to see what people will do with this new model.

hanniabu
0 replies
1d1h

but the real utility of this will be the temporal consistency

The main utility will me misinformation

ericpauley
24 replies
1d3h

I'm still puzzled as to how these "non-commercial" model licenses are supposed to be enforceable. Software licenses govern the redistribution of the software, not products produced with it. An image isn't GPL'd because it was produced with GIMP.

yorwba
7 replies
1d2h

The license is a contract that allows you to use the software provided you fulfill some conditions. If you do not fulfill the conditions, you have no right to a copy of the software and can be sued. This enforcement mechanism is the same whether the conditions are that you include source code with copies you redistribute, or that you may only use it for evil, or that you must pay a monthly fee. Of course this enforcement mechanism may turn out to be ineffective if it's hard to discover that you're violating the conditions.

comex
6 replies
1d2h

It also somewhat depends on open legal questions like whether models are copyrightable and, if so, whether model outputs are derivative works of the model. Suppose that models are not copyrightable, due to their not being the product of human creativity (this is debatable). Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree. Agreement can happen explicitly by pressing a button, or potentially implicitly just by downloading the model from them, if the terms are clearly disclosed beforehand. But if someone decides on their own (not induced by you in any way) to violate the contract by uploading it somewhere else, and you passively download it from there, then you may be in the clear.

ronsor
5 replies
1d

Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree.

I don't think it's possible to invent copyright-like rights.

yorwba
4 replies
23h22m

Why not? Two willing parties can agree to bind themselves to all kinds of obligations in a contract as long as they're not explicitly illegal.

Copyleft is an example of someone successfully inventing a copyright-like right by bootstrapping off existing copyright with a specially engineered contract.

frognumber
3 replies
19h50m

There are a few problems:

1) You and I invent our own private "copyright" for data (which is not copyrightable)

2) Everything is fine until my wife walks up to my computer and makes a copy of the data. She's not bound by our private "copyright." She doesn't even know it exists, and shares the data with her bestie.

And... our private pseudo-copyright is dead.

Also: Licenses are not the same as contracts. There are times when something can be both, one, or the other. But there are a lot of limits on how far they reach. The output of a program is rarely copyrightable by the author (as opposed to the user).

yorwba
2 replies
12h58m

my wife walks up to my computer and makes a copy of the data

As you agreed to in our contract, you now need to compensate me for the damage caused by your failure to prevent unauthorized third-party access. Of course you're free to attempt to recover the sum you have to pay me from your wife.

The output of a program is rarely copyrightable by the author (as opposed to the user).

The author of the program can make it a condition of letting the user use the program that the user has to assign all copyright to the author of the program, kind of like "By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed." https://www.ycombinator.com/legal/

frognumber
1 replies
7h32m

Okay. Now put yourself in the position of Microsoft, using this scheme for Windows. We'll pretend real copyright doesn't exist, and we've got your hair-brained scheme. This is how it plays out:

1) You have a $1T product.

2) My wife leaks it, or a burglar does. I am a typical consumer, with say, a $20k net worth.

You have two choices:

1) Sue me, recover $20k, and be down $1T (minus $20k, plus litigation fees), and get the press of ruining the life of some innocent random person

2) Not sue me. Be down $1T (including the $20k) .

And yes, the author of a program can put whatever conditions they want into the license: "By using this program, you agree to transfer $1M into my bank account in bit coin, to give me your first-born baby, to swear fealty to me, and to give me your wife it servitude." A court can then read those conditions, have a good laugh, and not enforce them. There are very clear limits on what a court will enforce in licenses (and contracts), and owning the output of a program, and barring exceptional circumstance, courts will not enforce them:

https://www.lexology.com/library/detail.aspx?g=eb52567a-2104...

This is why programmers should learn basic law, not treat is as computer code, and consult lawyers when issues come up. Read by a lawyer, a license or contract with an unenforceable clause is as good as having no such clause.

yorwba
0 replies
5h8m

There are very clear limits on what a court will enforce in licenses (and contracts), and owning the output of a program, and barring exceptional circumstance, courts will not enforce them:

It seems to me that the cases in the article you linked involved the author of the program arguing that their copyright automatically extended to the output without any extra contractual provisions concerning copyright assignment, so I don't think they can be used as precedent regarding the enforceability of such clauses.

dist-epoch
4 replies
1d2h

Visual Studio Community (and many other products) only allows "non-commercial" usage. Sounds like it limits what you can do with what you produce with it.

At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

As an example, see the Creative Commons license, ShareAlike clause:

If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
antonyt
2 replies
23h46m

Do you have link for the VS Community terms you're describing? What I've found is directly contradictory: "Any individual developer can use Visual Studio Community to create their own free or paid apps." From https://visualstudio.microsoft.com/vs/community/

dist-epoch
1 replies
23h36m

Enterprise organizations are not allowed to use VS Community for commercial purposes:

In enterprise organizations (meaning those with >250 PCs or >$1 Million US Dollars in annual revenue), no use is permitted beyond the open source, academic research, and classroom learning environment scenarios described above.
antonyt
0 replies
5h29m

I see, thanks!

blibble
0 replies
1d2h

At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

you can put whatever you want in a contract, doesn't mean it's enforceable

kmeisthax
2 replies
1d1h

So, there's a few different things interacting here that are a little confusing.

First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].

Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.

In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.

The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.

Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].

However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.

If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.

[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.

The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.

[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!

[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.

[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.

rperez333
0 replies
20h46m

If a company train the model from scratch, on its own dataset, could the resulting model be used commercially?

dragonwriter
0 replies
1d1h

Licenses are, at least under US law, contracts

Not necessarily; gratuitous licenses are not contracts. Licenses which happen to also meet the requirements for contracts (or be embedded in agreements that do) are contracts or components of contracts, but that's not all licenses.

cubefox
2 replies
1d3h

Nobody claimed otherwise?

not2b
0 replies
1d2h

There are sites that make Stable Diffusion-derived models available, along with GPU resources, and they sell the service of generating images from the models. The company isn't permitting that use, and it seems that they could find violators and shut them down.

littlethoughts
0 replies
1d2h

Fantasy.ai was subject to controversy for attempting to license models.

SXX
2 replies
1d

It doesn't have to be enforceable. This licensing model works exactly the same as Microsoft Windows licensing or WinRAR licensing. Lots and lots of people have pirated Windows or just buy some cheap keys off Ebay, but no one of them in their sane mind would use anything like that at their company.

The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.

So it's possible to ignore license, but legal and financial risks are not worth it for businesses.

taberiand
1 replies
22h14m

I've heard companies also intentionally do not go after individuals pirating software e.g., Adobe Photoshop - it benefits them to have students pirate and skill up on their software and then enter companies that buy Photoshop because their employees know it, over locking down and having those students, and then the businesses, switch to open source.

Duanemclemore
0 replies
2h56m

I'm sure there are plenty of other examples, but in my personal experience this was Autodesk's strategy with AutoCAD. Get market saturation by being extremely light on piracy. Then, once you're the only one standing lower the boom. I remember, it was almost like flipping a switch on a single DAY in the mid-00's when they went from totally lax on unpaid users to suing the bejeezus out of anyone who they had good enough documentation on.

One smart thing they did was they'd check the online job listings and if a firm advertised for needing AutoCAD experience they'd check their licenses. I knew firms who got calls from Autodesk legal the DAY AFTER posting an opening.

stevage
0 replies
21h5m

A software licence can definitely govern who can use it and what they can do with it.

An image isn't GPL'd because it was produced with GIMP.

That's because of how the GPL is written, not because of some limitation of software licences.

Der_Einzige
0 replies
1d2h

They're not enforceable.

spaceman_2020
17 replies
1d2h

A seemingly off topic question, but with enough compute and optimization, could you eventually simulate “reality”?

Like, at this point, what are the technical counters to the assertion that our world is a simulation?

KineticLensman
6 replies
1d2h

(disclaimer: worked in the sim industry for 25 years, still active in terms of physics-based rendering).

First off, there are zero technical proofs that we are in a sim, just a number of philosophical arguments.

In practical terms, we cannot yet simulate a single human cell at the molecular level, given the massive number of interactions that occur every microsecond. Simulating our entire universe is not technically possible within the lifetime of our universe, according to our current understanding of computation and physics. You either have to assume that ‘the sim’ is very narrowly focussed in scope and fidelity, and / or that the outer universe that hosts ‘the sim’ has laws of physics that are essentially magic from our perspective. In which case the simulation hypothesis is essentially a religious argument, where the creator typed 'let there be light' into his computer. If there isn't such a creator, the sim hypothesis 'merely' suggests that our universe, at its lowest levels, looks somewhat computational, which is an entirely different argument.

freedomben
4 replies
1d2h

I don't think you would need to simulate the entire universe, just enough of it that the consciousness receiving sense data can't encounter any missing info or "glitches" in the metaphorical matrix. Still hard of course, but substantially less compute intensive than every molecule in the universe.

kaashif
1 replies
1d

And you don't have to simulate it in real time, maybe 1 second here takes years or centuries to simulate outside the simulation. It's not like we'd have any way to tell.

hackerlight
0 replies
1d

These are all open questions in philosophy of mind. Nobody knows what causes consciousness/qualia so nobody knows if it's substrate dependent or not and therefore nobody knows if it can be simulated in a computer, or if it can nobody knows what type of computer is required for consciousness to be a property of the resulting simulation.

gcanyon
0 replies
1d1h

And if you’re in charge of the simulation, you get to decide how many “consciousnesses” there are, constraining them to be within your available compute. Maybe that’s ~8 billion — maybe it’s 1. Yeah, I’m feeling pretty Boltzmann-ish right now…

KineticLensman
0 replies
1d1h

but substantially less compute intensive than every molecule in the universe

Very true, but to me this view of the universe and one's existence within it as a sort of second-rate solipsist bodge isn't a satisfyingly profound answer to the question of life the universe and everything.

Although put like that it explains quite a lot.

[Edit] There is also a sense in which the sim-as-a-focussed-mini-universe view is even less falsifiable, because sim proponents address any doubt about the sim by moving the goal posts to accommodate what they claim is actually achievable by the putative creator/hacker on Planet Tharg or similar.

jdaxe
0 replies
21h34m

Maybe something like quantum mechanics are an "optimization" of the sim, i.e the sim doesn't actually compute the locations, spin etc of subatomic particles but instead just uses probabilities to simulate it. Only when a consciousness decides to look more closely does it retroactively decide what those properties really were.

Kind of like how video games won't render the full resolution textures when the character is far away or zoomed out.

I'm sure I'm not the first person to have thought this.

tracerbulletx
2 replies
1d2h

The brain does simulate reality in the sense that what you experience isn't direct sensory input, but more like a dream being generated to predict what it thinks is happening based on conflicting and imperfect sensory input.

danielbln
0 replies
1d2h

Example vision: comes in from the optic nerve warped and upside down and as small patches of high resolution captured by the eyes zigzagging across the visual field (saccades), all of which is assembled and integrated into a coherent field of vision by our trusty old grey blob.

accrual
0 replies
1d2h

To illustrate your point, an easily accessible example of this is how the second hand on clocks appears to freeze for longer than a second when you quickly glance at it. The brain is predicting/interpolating what it expects to see, creating the illusion of a delay.

https://www.popsci.com/how-time-seems-to-stop/

2-718-281-828
1 replies
1d2h

Like, at this point, what are the technical counters to the assertion that our world is a simulation?

How about this theory is neither verifiable nor falsifiable.

vidarh
0 replies
1d2h

The general concept is not falsifiable, but many variations might be, or their inverse might be. E.g. the theory that we are not in a simulation would in general be falsifiable by finding an "escape" from a simulation and so showing we are in one (but not finding an escape of course tells us nothing).

It's not a very useful endeavour to worry about, but it can be fun to speculate about what might give rise to testable hypotheses and what that might tell us about the world.

sesm
0 replies
9h22m

There can be no technical counters to the assertion that our world is a simulation. If our world is a simulation, then hardware/software that simulates it is outside of our world and it's technical constitution is inaccessible to us.

It's purely a religious question. When humanity invented the wheel, religion described the world as a giant wheel rotating in cycles. When humanity invented books, religion described the world as a book, and God as a it's writer. When humanity invented complex mechanism, religion described the world as giant mechanism and God as a watchmaker. Then computers where invented, and you can guess what happened next.

refulgentis
0 replies
1d2h

A little too freshman's first bit off a bong for me. There is, of course, substantial differences between video and reality.

Let's steel-man — you mean 3D VR. Let's stipulate there's a headset today that renders 3D visually indistinguishable from reality. We're still short the other 4 senses

Much like faith, there's always a way to sort of escape the traps here and say "can you PROVE this is base reality"

The general technical argument against "brain in a vat being stimulated" would be the computation expense of doing such, but you can also write that off with the equivalent of foveated rendering but for all senses / entities

justanotherjoe
0 replies
21h2m

That theory was never meant to be so airtight such that it 'needs' to be refuted.

beepbooptheory
0 replies
1d2h

Why does it matter? Not trying to dismiss, but truly, what would it mean to you if you could somehow verify the "simulation"?

If it would mean something drastic to you, I would be very curious to hear your preexisting existential beliefs/commitments.

People say this sometimes and its kind of slowly revealed to me that its just a new kind of geocentrism: its not just a simulation people have in mind, but one where earth/humans are centered, and the rest of the universe is just for the benefit of "our" part of the simulation.

Which is a fine theory I guess, but is also just essentially wanting God to exist with extra steps!

SXX
0 replies
1d

Actually it was already done by sentdex with GAN Theft Auto:

https://youtu.be/udPY5rQVoW0

To an extent...

PS: Video is 2 years old, but still really impressive.

helpmenotok
16 replies
1d3h

Can this be used for porn?

citrusui
7 replies
1d3h

Very unusual comment.

I do not think so as the chance of constructing a fleshy eldritch horror is quite high.

johndevor
3 replies
1d2h

How is that not the first question to ask? Porn has proven to be a fantastic litmus test of fast market penetration when it comes to new technologies.

xanderlewis
0 replies
1d2h

Market what?

throwaway743
0 replies
1d2h

No pun intended?

citrusui
0 replies
1d2h

This is true. I was hoping my educated guess of the outcome would minimize the possibility of anyone attempting this. And yet, here we are - the only losing strategy in the technology sector is to not try at all.

tstrimple
0 replies
1d3h

I do not think so as the chance of constructing a fleshy eldritch horror is quite high.

There is a market for everything!

crtasm
0 replies
1d2h

That didn't stop people using PornPen for images and it wouldn't stop them using something else for video.

ben_w
0 replies
1d2h

A surprisingly large number of people are into fleshy eldritch horrors.

hbn
1 replies
1d2h

Depends on whether trains, cars, and/or black cowboys tickle your fancy.

boppo1
0 replies
1d1h
theodric
0 replies
1d3h

If it can't, someone will massage it until it can. Porn, and probably also stock video to sell to YouTubers.

artursapek
0 replies
1d2h

Porn will be one of the main use cases for this technology. Porn sites pioneered video streaming technologies back in the day, and drove a lot of the innovation there.

alkonaut
0 replies
13h40m

The answer to that question is always "yes", regardless what "this" is.

Diffusion models for moving images are already used to a limited extent for this. And I'm sure it will be the use case, not just an edge case.

SXX
0 replies
1d

It's already posted to Unstable Diffusion discord so soon we'll know.

After all fine-tuning wouldn't take that long.

Racing0461
0 replies
1d2h

Nope, all commercial models are severly gated.

1024core
0 replies
1d2h

The question reminded me of this classic: https://www.youtube.com/watch?v=YRgNOyCnbqg

richthekid
15 replies
1d2h

This is gonna change everything

Chabsff
13 replies
1d2h

It's really not.

Don't get me wrong, this is insanely cool, but it's still a long way from good enough to be truly disruptive.

echelon
7 replies
1d2h

One year.

All of Hollywood falls.

Chabsff
4 replies
1d2h

No offense, but this is absolutely delusional.

As long as people can "clock" content generated from these models, it will be treated by consumers as low-effort drivel, no matter how much actual artistic effort goes in the exercise. Only once these systems push through the threshold of being indistinguishable from artistry will all hell break loose, and we are still very far from that.

Paint-by-numbers low-effort market-driven stuff will take a hit for sure, but that's only a portion of the market, and frankly not one I'm going to be missing.

ben_w
3 replies
1d2h

Very far, yes, but also in a fast moving field.

CGI in films used to be obvious all the time no matter how good the artists using it, now it's everywhere and only noticeable when that's the point; the gap from Tron to Fellowship of the Ring was 19.5 years.

My guess is the analogy here puts the quality of existing genAI somewhere near the equivalent of early TV CGI, given its use in one of the Marvel title sequences etc., but it is just an analogy and there's no guarantees of anything either way.

r3d0c
2 replies
1d2h

something unrelated improved overtime so something else unrelated will also improve to whatever goal you've set in your mind

weird logic circles yall keep making to justify your beliefs, i mean the world is very easy like you just described if you completely strip all nuance and complexity

people used to believe at the start of the space race we'd have mars colonies by now because they looked at the rate of technological advancement from 1910 to 1970, from the first flight to landing on the moon; yet that didn't happen because everything doesn't follow the same repeatable patterns

pessimizer
0 replies
1d1h

People also believed that recorded music would destroy the player piano industry and the market for piano rolls. Just because recorded music is cheaper doesn't mean that the audience will be willing to give up the actual sound of a piano being played.

ben_w
0 replies
1d1h

First, lotta artists already upset with genAI and the impact it has.

Second, I literally wrote the same point you seem to think is a gotcha:

it is just an analogy and there's no guarantees of anything either way
woeirua
1 replies
1d2h

Every time something like this is released someone comments how it’s going to blow up legacy studios. The only way you can possibly think that is that: 1-the studios themselves will somehow be prevented from using this tech themselves, and 2-that somehow customers will suddenly become amenable to low grade garbage movies. Hollywood already produces thousands of low grade B or C movies every year that cost fractions of what it costs to make a blockbuster. Those movies make almost nothing at the box office.

If anything, a deluge of cheap AI generated movies is going to lead to a flight to quality. The big studios will be more powerful because they will reap the productivity gains and use traditional techniques to smooth out the rough edges.

underscoring
0 replies
1d

2-that somehow customers will suddenly become amenable to low grade garbage movies

People have been amenable to low grade garbage movies for a long, long time. See Adam Sandler's back catalog.

evrenesat
4 replies
1d2h

In a few years' time, teenagers will be consuming shows and films made by their peers, not by streaming providers. They'll forgive and perhaps even appreciate the technical imperfections for the sake of uncensored, original content that fits perfectly with their cultural identity.

Actually, when processing power catches up, I'm expecting a movie engine with well-defined characters, scenes, entities, etc., so people will be able to share mostly text-based scenarios to watch on their hardware players.

nwienert
1 replies
1d

They do that now (forget the name there's a popular one my niece uses to make animated comics, others do similar things in Minecraft etc), and have been doing that since forever - nearly 30 years ago my friends and I were scribbling comic panels into our notebooks and sharing them around class.

znkynz
0 replies
23h19m

ms comic chat for the win

Chabsff
1 replies
1d2h

Similar to how all the kids today only play itch.io games thanks to Unity and Unreal dramatically lowering the bar of entry into game development.

Oh wait... No.

All it has done is create an environment where indy games are now assumed to be trash unless proven otherwise, making getting traction as a small developer orders of magnitude harder than it has ever been because their efforts are drowning in a sea of mediocrity.

That same thing is already starting to happen on youtube with AI content, and there's no reason for me to expect this going any other way.

evrenesat
0 replies
1d1h

It took ~2 years for my 10 year old daughter to get bored and give up the shitty user made roblox games and start playing on switch, steam or ps4.

jetsetk
0 replies
1d2h

Is it? How so?

christkv
15 replies
1d3h

Looks like I'm still good for my bet with some friends that before 2028 a team of 5-10 people will create a blockbuster style movie that today costs 100+ million USD on a shoestring budget and we won't be able to tell.

ben_w
5 replies
1d2h

I wouldn't bet either way.

Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.

I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.

[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …

Sohcahtoa82
2 replies
23h21m

Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic

Whenever I saw anybody calling those graphics "photorealistic", I always had to roll my eyes and question if those people were legally blind.

Like, c'mon. Yeah, they could be large leaps ahead of the previous generation, but photorealistic? Get real.

Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

ben_w
1 replies
11h35m

Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

Looking just at the videos (because I don't have time to play the latest games any more and even if I did it's unreleased), I think that "Unrecord" is also something I can't distinguish from a filmed cinematic experience[0]: https://store.steampowered.com/app/2381520/Unrecord/

Though there are still caveats even there, as the pixelated faces are almost certainly necessary given the state of the art; and because cinematic experiences are themselves fake, I can't tell if the guns are "really-real" or "Hollywood".

Buuuuut… I thought much the same about Myst back in the day, and even the bits that stayed impressive for years (the fancy bedroom in the Stoneship age), don't stand out any more. Riven was better, but even that's not really realistic now. I think I did manage to fool my GCSE art teacher at the time with a printed screenshot from Riven, but that might just have been because printers were bad at everything.

Sohcahtoa82
0 replies
4h44m

Unrecord looks amazing, I forgot about that one.

IMO, though, the lighting in the indoor scenes is just not quite right. There's something uncanny valley about it to me. When the flashlight shines, it's clearly still a computer render to my eyes.

The outdoor shots, though, definitely look flawless.

Keyframe
1 replies
1d1h

Your comment reminded me of this: https://www.reddit.com/r/gaming/comments/ktyr1/unreal_yes_th...

Many more examples, of course.

ben_w
0 replies
1d

Yup, that castle flyby, those reflections. I remember being mesmerised by the sequence as a teenager.

Big quality improvement over Marathon 2 on a mid-90s Mac, which itself was a substantial boost over the Commodore 64 and NES I'd been playing on before that.

accrual
2 replies
1d2h

The first full-length AI generated movie will be an important milestone for sure, and will probably become a "required watch" for future AI history classes. I wonder what the Rotten Tomatoes page will look like.

qiine
0 replies
1d2h

"I wonder what the Rotten Tomatoes page will look like"

Surely it will be written using machine vision and llms !

jjkaczor
0 replies
1d2h

As per the reviews - it will be hard to say, as both positive and negative takes will be uploaded by ChatGPT bots (or it's myriad of descendents).

deckard1
1 replies
1d2h

I'm imagining more of an AI that takes a standard movie screenplay and a sidecar file, similar to a CSS file for the web and generates the movie. This sidecar file would contain the "director" of the movie, with camera angles, shot length and speed, color grading, etc. Don't like how the new Dune movie looks? Edit the stylesheet and make it your own. Personalized remixed blockbusters.

On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.

timeon
0 replies
1d1h

thousand Beatles out there. It still requires talent and effort

As well as marketing.

throwaway743
0 replies
1d2h

Definitely a big first for benchmarks. After that hyper personalized content/media generated on-demand

marcusverus
0 replies
1d2h

I'm pumped for this future, but I'm not sure that I buy your optimistic timeline. If the history of AI has taught us anything, it is that the last 1% of of progress is the hardest half. And given the unforgiving nature of the uncanny valley, the video produced by such a system will be worthless until it is damn-near perfect. That's a tall order!

henriquecm8
0 replies
9h10m

What I am really looking forward is some Star Trek style holodeck, but I guess we will start with it in VR headsets first.

Geordi: "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him"

CamperBob2
0 replies
1d3h

It'll happen, but I think you're early. 2038 for sure, unless something drastic happens to stop it (or is forced to happen.)

speedgoose
6 replies
1d2h

Has anyone managed to run the thing? I got the streamlit demo to start after fighting with pytorch, mamba, and pip for half an hour, but the demo runs out of GPU memory after a little while. I have 24GB on GPU on the machine I used, does it need more?

mkaic
2 replies
1d1h

Have heard from others attempting it that it needs 40GB, so basically an A100/A6000/H100 or other large card. Or an Apple Silicon Mac with a bunch of unified memory, I guess.

speedgoose
0 replies
1d1h

Alright thanks for the information. I will try to justify using one A100 for my "very important" research activities.

mlboss
0 replies
1d1h

Give it a week.

skonteam
1 replies
1d1h

Yeah, got a 24GB 4090, try to reduce the number of frames decoded to something like 4 or 8. Although, keep in mind it caps the 24Gb and goes to RAM (with the latest nvidia drivers).

speedgoose
0 replies
1d1h

Oh yes it works, thanks!

nwoli
0 replies
1d

Is the checkpoint default fp16 or fp32?

firefoxd
5 replies
21h11m

I understand the magnitude of innovation that's going on here. But still feel like we are generating these videos with both hands tied behind our backs. In other words, it's nearly impossible to edit the videos in this constraints. (Imagine trying to edit the blue Jays to get the perfect view).

Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.

It may sound like I'm complaining, but I'm just ask making a feature request...

huytersd
3 replies
20h57m

What would solve all these issues is full generation of 3D models that we hopefully get a chance to see over the next decade. I’ve been advocating for a solid LiDAR camera on the iPhone so there is a lot of training data for these LLMs.

ricardobeat
2 replies
20h17m

I’ve been advocating for a solid LiDAR camera on the iPhone

What do you mean by “advocating”? The iPhone has had a LiDAR camera since 2020.

xvector
1 replies
20h3m

That's probably why they qualified with "solid", the iPhone's LiDAR camera is quite terrible.

huytersd
0 replies
18h26m

Yes, exactly.

jwoodbridge
0 replies
17h59m

we're working on this - dream3d.com

neaumusic
3 replies
1d

It's funny that still don't really have video wallpapers on most devices (I'm only aware of Wallpaper Engine on Windows)

Sohcahtoa82
1 replies
23h28m

I had a video wallpaper on my Motorola Droid back in 2010.

tetris11
0 replies
14h27m

and a battery life of...?

I do wonder if there have been any codec studies that measure power usage with respect to RAM

spupy
0 replies
13h47m

Mplayer/MPV used to be able to play videos in the X root window like a wallpaper. No idea if it still works nowadays.

minimaxir
3 replies
1d3h

Model weights (two variations, each 10GB) are available without waitlist/approval: https://huggingface.co/stabilityai/stable-video-diffusion-im...

The LICENSE is a special non-commercial one: https://huggingface.co/stabilityai/stable-video-diffusion-im...

It's unclear how exactly to run it easily: diffusers has video generation support now but need to see if it plugs in seamlessly.

chankstein38
1 replies
1d3h

It looks like the huggingface page links their github that seems to have python scripts to run these: https://github.com/Stability-AI/generative-models

minimaxir
0 replies
1d3h

Those scripts aren't as easy to use or iterate upon since they are CLI apps instead of a REPL like a Colab/Jupyter Notebook (although these models probably will not run in a normal Colab without shenanigans).

They can be hacked into a Jupyter Notebook but it's really not fun.

ronsor
0 replies
1d3h

Regular reminder that it is very likely that model weights can't be copyrighted (and thus can't be licensed).

torginus
2 replies
1d2h

I admit I'm ignorant about these model's inner workings, but I don't understand why text is the chosen input format for these models.

It was the same for image generation, where one needed to produce text prompts to create the image, and stuff like img2img and Controlnet that allowed things like controlling poses and inpainting, or having multiple prompts with masks controlling which part of the image is influenced by which prompt.

pizzafeelsright
0 replies
1d1h

Imago Deo? The Word is what is spoken when we create.

The input eventually becomes meanings mapped to reality.

gorbypark
0 replies
1d2h

According to the GitHub repo this is an "image-to-video model". They tease of an upcoming "text to video" interface on the linked landing page, though. My guess is that interface will use a text-to-image model and then feed that into the image-to-video model.

rbhuta
2 replies
19h52m

VRAM requirements are big for this launch. We're hosting this for free at https://app.decoherence.co/stablevideo. Disclaimer: Google log-in required to help us reduce spam.

xena
1 replies
18h58m

How big is big?

whywhywhywhy
0 replies
3h21m

40GB although hearing reports 3090 can do low frame counts

keiferski
2 replies
16h43m

Question for anyone more familiar with this space: are there any high-quality tools which take an image and make it into a short video? For example, an image of a tree becomes a video of a tree swaying in the wind.

I have googled for it but mostly just get low quality web tools.

circuit10
1 replies
11h26m

That's what this is

keiferski
0 replies
10h27m

Hmm, for some reason I was understanding this as a text-to-video model. I’ll have to read this again.

gregorymichael
2 replies
20h26m

How long until Replicate has this available?

rbhuta
0 replies
19h46m

We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam.

Let me know what you think of it! It works best on landscape images from my tests.

radicality
0 replies
17h42m

Looks like there is a WIP here: https://replicate.com/lucataco/svd

awongh
2 replies
1d2h

It makes sense that they had to take out all of the cuts and fades from the training data to improve results.

I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?

machinekob
0 replies
1d1h

I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.

I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383

flaghacker
0 replies
23h27m

It means that instead of (only) doing convolution in spatial dimensions, it also(/instead) happens in the temporal dimension.

A good resource for the "instead" case: https://unit8.com/resources/temporal-convolutional-networks-...

The "also" case is an example of 3D convolution, an example of a paper that uses it: https://www.cv-foundation.org/openaccess/content_iccv_2015/p...

youssefabdelm
1 replies
1d3h

Can't wait for these things to not suck

accrual
0 replies
1d2h

It's definitely pretty impressive already. If there could be some kind of "final pass" to remove the slightly glitchy generative artifacts, these look completely passible for simple .gif/.webm header images. Especially if they could be made to loop smoothly ala Snapchat's bounce filter.

renlo
1 replies
14h23m

How much longer will it be until we can play "video games" which consist of user-input streamed to an AI that generates video output and streams it to the player's screen?

slow_numbnut
0 replies
11h59m

If you're willing to accept text based output then Text adventure style games and even simulating bash was possible using chatgpt until openAI nerfed it.

nuclearsugar
1 replies
22h48m

Very excited to play with this. Some of my latest experiments - https://www.jasonfletcher.info/vjloops/

rbhuta
0 replies
19h45m

We're hosting this free (no credit card needed) at https://app.decoherence.co/stablevideo Disclaimer: Google log-in required to help us reduce spam. Let me know what you think of it! It works best on landscape images from my tests.

iamgopal
1 replies
8h14m

Very soon, we will be able to change story line of a web series dynamically, a little more thrill, a little more comedy, changing character face to matching ours and others, all in 3D with 360 degree view, how far are we from this ? 5 year ?

niek_pas
0 replies
7h44m

At least several decades, I’d say. This is a hugely complex, multifaceted problem. LLMs can’t even write half-decent screenplays yet.

dinvlad
1 replies
1d3h

Seems relatively unimpressive tbh - it's not really a video, and we've seen this kind of thing for a few months now

accrual
0 replies
1d2h

It seems like the breakthrough is that the video generating method is now baked into the model and generator. I've seen several fairly impressive AI animations as well, but until now, I assumed they were tediously cobbled together by hacking on the still-image SD models.

accrual
1 replies
1d3h

Fascinating leap forward.

It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.

I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".

Edit: Indeed the special sauce is "temporal layers". [0]

Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets

[0] https://stability.ai/research/stable-video-diffusion-scaling...

adventured
0 replies
1d2h

The hardest problem the Stable Diffusion community has dealt with in terms of quality has been in the video space, largely in relation to the consistency between frames. It's probably the most commonly discussed problem for example on r/stablediffusion. Temporal consistency is the popular term for that.

So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:

https://www.reddit.com/r/StableDiffusion/comments/180no09/on...

Fixing that frame to frame jitter related to animation is probably the most in-demand thing around Stable Diffusion right now.

Animatediff motion painting made a splash the other day:

https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...

It's definitely an exciting time around SD + animation. You can see how close it is to reaching the next level of generation.

AltruisticGapHN
1 replies
11h40m

These are basically like animated postcards, like you often see now on loading screens in videogames. A single picture has been animated. Still a long shot from actual video.

siddbudd
0 replies
10h45m

"2 more papers down the line"...

shaileshm
0 replies
22h54m

This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?

Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!

rvion
0 replies
10h13m

Finally ! Now that this is out, I can finally start adding proper video widgets to CushyStudio https://github.com/rvion/CushyStudio#readme . Really hope I can get in touch with StabilityAi people soon. Maybe Hacker News will help

pcj-github
0 replies
1d

Soon the hollywood strike won't even matter, won't need any of those jobs. Entire west coast economy obliterated.

nbzso
0 replies
1d2h

Model chain:

Instance One : Act as a top tier Hollywood scenarist, use the public available data for emotional sentiment to generate a storyline, apply the well known archetypes from proven blockbusters for character development. Move to instance two.

Instance Two: Act as top tier producer. {insert generated prompt}. Move to instance three.

Instance Three: Generate Meta-humans and load personality traits. Move to instance four.

Instance Four: Act as a top tier director.{insert generated prompt}. Move to instance five.

Instance Five: Act as a top tier editor.{insert generated prompt}. Move to instance six.

Instance Six: Act as a top tier marketing and advertisement agency.{insert generated prompt}. Move to instance seven.

Instance Seven: Act as a top tier accountant, generate an interface to real-time ROI data and give me the results on an optimized timeline into my AI induced dream.

Personal GPT: Buy some stocks, diversify my portfolio, stock up on synthetic meat, bug-coke and Soma. Call my mom and tell her I made it.

jonplackett
0 replies
23h38m

Is this available in the stability API any time soon?

epiccoleman
0 replies
1d2h

This is really, really cool. A few months ago I was playing with some of the "video" generation models on Replicate, and I got some really neat results[1], but it was very clear that the resulting videos were made from prompting each "frame" with the previous one. This looks like it can actually figure out how to make something that has a higher level context to it.

It's crazy to see this level of progress in just a bit over half a year.

[1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-diff...

didip
0 replies
22h11m

Stability.ai, please make sure your board is sane.

devdiary
0 replies
15h17m

A default glitch effect in the video can make the distortions a "feature not a bug"

chrononaut
0 replies
23h30m

Much like in static images, the subtle unintended imperfections are quite interesting to observe.

For example, the man in the cowboy hat seems he is almost gagging. In the train video the tracks seem to be too wide while the train ice skates across them.

aliljet
0 replies
1d2h

I've been following this space very very closely and the killer feature would be to be able to generate these full featured videos for longer than a few seconds with consistently shaped "characters" (e.g., flowers, and grass, and houses, and cars, actors, etc.). Right now, it's not clear to me that this is achieving that objective. This feels like it could be great to create short GIFs, but at what cost?

To be clear, this remains wicked, wicked, wicked exciting.

TruthWillHurt
0 replies
9h51m

And thanks to the porn community on Civit.ai!

RandomBK
0 replies
21h41m

Needs 40GB VRAM, down to 24GB by reducing the number of frames processed in parallel.

LoveMortuus
0 replies
6h12m

Once text-to-video is good enough and once text generation is good enough, we could legit actually have endless TV shows produced by individuals! We're probably still far away from that, but it is exciting to think about!

I think this will really open new ways and new doors to creativity and creative expression.

Eduard
0 replies
21h2m

cannot join the waiting list (nor opt in for marketing newsletter), because the sign-up form checkboxes don't toggle on android mobile Chrome or Firefox.