Just tried to run this using their sample script on my 4090 (which has 24GB of VRAM). It ran for a little over 1 minute and crashed with an out-of-memory error. I tried both SV3D_u and SV3D_p models.

[edit]Managed to generate by tweaking the script to generate less frames simultaneously. 19.5GB peak VRAM usage, 1 min 25 secs to generate at 225 watts.[/edit]

4090 is in weird spot. High speed but low RAM. Theoretically everything should run in ai but practically nothing runs

Maybe dont use a gaming card for ai then? 24 is plenty as most games dont use more than half in 4k.

Maybe give me lots of money to give Nvidia for a card with more memory then?

Nvidia have held back the majority of their cards from going over 24GB for years now. It's 2024 and my laptop has 96GB of RAM available to the GPU but desktop GPUs that cost several thousands just by themselves are stuck at 24GB.

They don’t get their absurd profit margins by cannibalising their data centre chips.

This is like Intel and their refusal to support ECC memory; when AMD does on nearly all Ryzens.

Note: your laptop is probably using a 64-bit memory bus for system RAM. For GPUs, the 4090 is 384-bit. That takes up a lot more die area for the bus and memory controller.

But GP's laptop with 96GB of unified memory would be a M2 Max Macbook or better. The M2 Max has a 4 x 128-bit memory bus (410GB/s) and the M2 Ultra is 8 x 128bit (819GB/s), versus a 4090 at 1008GB/s. But see here for caveats about Mac bandwidth:

Which laptop models share system RAM with an Nvidia RTX cards?

Op probably referring to an M series MacBook since it has a unified memory architecture and the same memory space used by both cpu and gpu

Why would they do that with a gaming card? If you want more you can rent in Aws etc.

It wouldn’t be a local model if it has to work on AWS.

Isn't there the risk that if they give the gaming cards enough RAM for such tasks then they'll get bought up for that purpose and the second-hand price will go even higher?

I guess my point is, rather than give the cards more RAM, the gaming cards should just be priced cheaper.

This is unfairly downvoted. They launched 3090 on Sep 2020 with 24GB which was more than AMD's 16GB 6900XT launched on that same month. Maybe before blaming Nvidia, blame AMD for lack of trying to compete with them? Of course they're not gonna release a gaming card with loads more VRAM because a) competition doesn't exist nor has gaming cards with more VRAM b) it would all be bought up for AI workloads c) games don't really need more as parent said.

Perhaps NVIDIA or somebody could invent a RAM upgrade via NVLINK? Seems plausible and not every problem would want to add another GPU when the ability to add the extra memory alone is all they need.

4 replies

But why would NVIDIA do that when they can just sell you an A100 for ten times the price of a 4090?

2 replies

We need AMD to compete, but from what I know their software is subpar to NVIDIA's offering and most of the current ML stacks are built around CUDA. Still there's a lot of money to be made in this area now so competition big and small should pop up.

1 replies

I'd love it if AMD and Intel teamed up to make a wrapper layer for CUDA. Surely they'd both benefit greatly.

0 replies

First Intel and then AMD funded a wrapper, yes. Unfortunately the new version supports AMD but no longer Intel.

That's a binary level wrapper. Of course there's also ROCm HIP at the source level, and many other things, such as SYCL

In a hypothetical near-future world, competition?

the memory is inherent to the gpu architecture. You cannot just add VRAM and expect no other bottlenecks to pop up. Yes they can reduce the VRAM to create budget models and save a bit here and there. But adding VRAM to a top model is a tricky endeavour

4 replies

Yeah, I’m still debating whether I go with a Mac Studio with the RAM maxed out (approx $7500 for 192 GB) or a PC with a 4090. Is there a better value path with the Nvidia A series or something else? (I’m not sure about tibygrad)

You can get a previous gen RTX A6000 with 48GB of gddr6 for about $5000 (1). Disclosure: I run that website. Is anyone using the pro cards for inference?


I have an M1 Max with 64GB and 3090 Ti. M1 Max is ~4x slower at inference for the same models than 3090 (i.e. 7t/s vs 30t/s), which depending on the task can be very annoying. As a plus you get to run really large models, albeit very slowly. Think if that will bother you or not. I will not give up my 3090 Ti and am rather waiting for 5090 to see what it can do because when programming, the Mac is too slow to shoot of questions. I use it mostly to better understand book topics now and 3090 Ti to do fast chat sessions.

Groq may be an option?

Just don't max out the Mac Studio and get both...

3 replies

"Theoretically everything should run in ai"

Odd statement. I don't really know what you mean by that. Perhaps 'math _works_, code should too' ?

I would definitely agree that it _should_ work.

I'm of the belief that no one should _have to_ publish (e.g. to graduate, get promotions, etc) in academia, and that publications should only occur if they're believed to be near Novel prize worthy, and fully reproducible by code with packaging that should last and work in 10 years, from data archives that will exist in 10 years.

But it seems I have been outvoted by the administration in academia.

Hence, we get this "ai that doesn't run" phenomenon

What's the point of academia if not to publish?

Do you want to publicly fund researchers only for the industrial research partner's benefit?

It already is effectively just for industry benefit. It's been like that since the start. Work that is too expensive for industry to do (research and discovery) was put into the public sphere such that the role of industry was to take that innovation and optimize it. That's at least how it is intentionally constructed.

My main point was that there is a lot of noise in scientific journals that are caused from pressures in academia that are requirements if publishing. If these are removed, then the quality of work published increases and quantity decreased.

There are other places to post work that is derivative and non-novel like blogs. The field of biology has an immense amount of work that is mostly observational without strong conclusions or predictivity. A tabulation of observation should definitely be put out by a lab, and it should be much sooner with far less pressures than today, such as the typical dance of putting the data in during publication. The SRA is one example of a place to share data. If the typical way to work was put all data immediately onto a public repo, sometimes comment on it in ways that have been seen before on blogs and other classes below scientific journals, and then if something truly substantial comes out of it (a novel model that is analytical and highly predictive of cell behavior in all situations for example) then publish.

It could alleviate the noise from the signal. LLMs is one case where the noise is very strong in that many papers are simply 'we fine tuned an llm'.

So how should knowledge be shared in academia without publishing? Any work worthy of a Nobel Prize (or more likely, a Turing Award) is built on top of significant amounts of other research that itself wasn't so groundbreaking.

That said, I certainly think that researchers can do more to make their code and data more accessible. We have the tools to do so already but the incentives are often misaligned.

2 replies

Almost sounds like a GPU vendor who isn't seeing enough competition.

Or, you know, the fact that the card is made for playing video games, not training AI models.

Almost like the only competition of Nvidia is the niece of the CEO.

2 replies

Didn't know 24GB was considered low lol.

For AI that's either a very fat SDXL model at it's max native resolution, or a quantized 34B parameter model, so it's on the low size. Compare that with the Blackwell AI "superchip" announced yesterday that appears to the programmer as single GPU with 30TB of RAM.

2 replies

They don't want to cannibalize sales of the super-expensive GPUs dedicated to ML/AI.

5090 likely won't have more than 32 GB, if even that much.

Even 32GB would be great for a gaming card, any more and you're never seeing on sale as it will be bought by truckloads for AI, so of course they're not gonna balloon the VRAM. I suspect we'd still be at 16GB but they launched 3090 on Sep 2020 with 24GB, before all this craze really, and lowering is bad optics now.

I made a Manifold market[0] on the amount of ram a 5090 will have, and while pretty much nobody has participated, I just checked and the market is amusingly at the 32GB you've also quoted. Just like you, I hope it will be more but I fear it will be even less.


You can add multiple, but practically speaking you're better off with used 3090s which you get 2 for the price of one 4090.

I have 3090 Ti and I can run Q4 quant 33b models at 30t/s with 8k context. A 4090 would allow me to do the same but with ~45t/s, both inference speeds are more than fast enough for people so 3090 is the usual choice. In my tests on runpod, H100 with 80GB memory is around the same speed as 3090, so slower than a 4090.

Don't forget the 24GB P40, which is a third the speed but also a third the cost if a 3090 (both used).

What can't you run? Unquantised large text models are the only thing I can't run

Stable diffusion, stable video, text models, audio models, I never had issues with anything yet

The 4090 is in a bit of a funny space for LLMs.

There's a lot of open weights activity around 7B/13B models which the 4090 will run with ease. But you could can run those OK on much cheaper cards like the 4070Ti (which is of course why they're popular).

And there's a lot of open weights activity around 70B and 8x7B models which are state-of-the-art - but too big to fit on a 4090. There's not much activity around 30B models, which are too big to be mainstream and too small to be cutting edge.

If you're specifically looking to QLoRA fine-tune a 7B/13B model a 4090 can do that - but if you want to go bigger than that you'll end up using a cloud multi-gpu machine anyway.

1 replies

4090 has more VRAM than most computers have system RAM. Surprised this is considered "low RAM" in any way except for relative to datacenter cards and top-spec ASi.

You're comparing RAM amounts to other RAM amounts without considering requirements. 24GB is more than (most) current games would ever require, but is considered a uncomfortably-constrictive minimum for most industrial work.

Traditional CPU-bound physics/simulation models have typically wanted all the RAM they could get; the more RAM the more accurate the model. The same is true for AI models.

I can max out 24GB just using spreadsheets and databases, let alone my 3D work or anything computational.

It is targeted to gamers, that professionals are buying. They should be buying A6000 which has 48GB.

1 replies

Yeah, this is to be expected with early adoption. This stuff comes out of the lab and it's not perfect. The key thing to evaluate is the trajectory and pace of development. Much of what folks challenged ChatGPT with a year ago is long lost in the dust. Go look at stable diffusion this time last year. Dall-E couldn't do words and hands, it nails that 90% of the time in my experience today.

About words, Dall-e is nor even close to nail it 90% of the time. Not even 50%. Maybe they nerf it when you request a logo from it, but that was my experience in the last few days.

1 replies

I managed to get it working with a 4090. You need to adjust the parameter decoding_t of the sample function in to a lower value (decoding_t = 5 works fine for me). I also needed to install imageio==2.19.3 and imageio-ffmpeg

Ah, yep! You're right! It works now!

Dunno why the defaults for this stuff isn't the base performance, feel I always have to tweak the batch size down on all the base scripts even with 24gb cos everything assumes 48gb

If the animations shown are representative, then the mesh output may very well be good enough to use in a 3d printer.

Looking forward to experimenting with this.

With previous attempts at this problem the shaded examples could be quite misleading because details that appeared to be geometric were actually just painted over the surface as part of the texture, so when you took that texture away you just had a melted looking blob with nowhere near as much detail as you thought. I'd reserve judgement until we see some unshaded meshes.

What they show in the demo:

What comes out of the 3D printer:

It’s always been this. None of these ever show the untextured model.

When I see a demo where they are showing wireframes I know it’ll be good enough.

Seems like a tougher nut to crack than image generation was, since there isn't a bajillion high quality 3D models lying around on the internet to use as training data, everyone is trying to do 3D model generation as a second-order system using images as the training data again. The things that make 3D assets good, the tiny geometric details that are hard to infer without many input views of the same object, the quality of the mesh topology and UV mapping, rigging and skinning for animation, reducing materials down to PBR channels that can be fed into a renderer and so on aren't represented in the input training data, so the model is expected to make far more logical leaps than image generators do.

2 replies

It almost seems easier, in that you have an arbitrary # of real world objects to scan and the hardware is heavily commoditized (IIRC iPhones have this built in at highres now?)

How is building a dataset easier than using a prebuilt dataset?

In context, the conversation was beyond a dichotomy - thankfully. Having only 2 choices leaves conversation at people insisting one is better, and becomes an argument about definitions where people take turns alternating being "right" from the viewpoint of a neutral observer.

It's proposing a solution to the author's observation that everyone is doing it in second order fashion and missing a significant amount of necessary data.

The implication is that rather than doing it the hard way via the already-obtained 2nd order dataset, it'll be easier to get a new dataset, and getting that dataset will be significantly easier that it was to get the second-order dataset, as you don't need to worry about aesthetic variety as much as teaching what level of detail is needed in the mesh for it to be "real"

I know where I could get several hundred terabytes (maybe an exabyte? It’s constantly growing) of ultra high quality STL files designed for 3D printing. I just don’t have the storage or the knowledge of how to turn those into a model that outputs new STL files.

I’d imagine it’d require a ton of tagging, although I have a good idea of how I could leverage existing APIs to tag it mostly automatically by generating three still image thumbnails of the content, then feeding that through CLIP, and verifying that all two or three agree on what it’s an STL of, and manually tag the ones that fail that test.

There’s a pretty big difference between hundreds of terabytes and an exabyte. Maybe you meant petabyte?

since there isn't a bajillion high quality 3D models lying around on the internet to use as training data

There aren't a bajillion high-quality 3D models of everything, but there are an unbounded number of high-quality 3D models of some things, due to the existence of procedural mesh systems for things like foliage.

Although I wonder if having a few very-well-understood object types like these, to serve as a base, would be enough to allow such a model to deduce more generalized rules of optics, such that it could then be trained on other object categories with much smaller training sets...

Couldn’t a deep network learn the latent 3D representation just on video input?

0 replies

(I dream of the day when this can be used to automatically create paper-craft templates.)

4 replies

There exists software to reproject texture normals back on to a high poly model. So this problem does have a solution for anyone interested.

3 replies

1 replies

I may be speaking out of ignorance here, but couldn't you use photogrammetry techniques to translate these to a higher resolution mesh?

Only if you have multiple images of the same areas so that you can extract actual position. And there is no guarantee that multiple pictures of the same model have the same detail, much less in a manner that can be triangulated with accuracy. A lot of the photogrammetry algorithms discard points that don't match certain error-bars.

So yes, there might be a wooden frame in the middle of that window, but does it match the math on both angles of it? Doubt it.

You can generate pretty reliable texture depth maps from just an image. It’s going to be trash if you’re trying to generate the depth for the entire 3D model but I presume it’s going to go a good job with just texture. Then you just use a displacement based on the depth map.

2 replies

Therefore, what is the main usecase of this model? Generating cheap 3D assets for videogames?

I don't think they have a specific use-case for this model, they're throwing ideas at the wall again in the hopes some of them stick and eventually turn into another product. The paper doesn't discuss any of the problems that would need to be solved in order to easily generate game-ready assets so I think it's safe to assume that it currently doesn't.

For games at the very least you need to consider polygon budget, getting reasonably good UVs, and generating materials which fit into a PBR shader pipeline, at least if it's going to work with rendering pipelines as we know them today (as opposed to rendering neural representations directly, which is a thing people are trying to do but is totally unproven in production).

I'd be willing to bet you could create a diffusion model to map unrefined meshes to UV-fixed and remeshed surfaces. If you had a large enough library of good meshes you just programmatically mess 'em up and use that as the dataset.

2 replies

There are AI models who can create proper meshes though.

Which ones?

6 replies

I don't know much about 3D printing, would be very interested in learning more about this idea if you'd be so kind as to expand on it. Could I have AI spend all day auto scanning what teens are doing on instagram, auto generate toys based on it, auto generate advertisements for the toys, auto 3D print on demand?

OP is suggesting that this (AI model? I honestly am behind on the terminology) could replace one of the common steps of 3D printing - specifically, the step where you create a digital representation of the physical object you would want to end up with.

There are other steps to 3D printing in general, though; a super rough outline:

- Model generation

- "Slicing" - processing the 3D model into instructions that the 3D printer can handle, as well as adding any support structures or other modifications to make it printable

- Printing - the actual printing process

- Post-processing - depending on the 3D printing technology used, the desired resulting product, and the specific model/slicing settings, this can be as simple as "remove from bed and use" to "carefully snip off support structures, let cure in a UV chamber for X minutes, sand and fill, then paint"

As I said before, this AI model specifically would cover 3D model generation. If you were to use a printing technology that doesn't require support structures, and handles color directly in the printing process (I think powder bed fusion is the only real option here?), the entire process should be fairly automatable - a human might be needed to remove the part from the printer, but there might not be much post-processing to do.

The rest of your desired workflow is a bit more nebulous - I don't know how you would handle "scanning what teens are doing on instagram", at least in a way that would let you generate toys from the information; generating and posting the advertisement shouldn't be too hard - have a standardish template that you fill in with a render from the model, and the description; printing on demand again is possible, though you'll likely need a human to remove the part, check it for quality and ship it. You could automate the latter, but that would probably be more trouble than it's worth.

Interesting, to be clear I don't think this is a good idea and it's kinda my nightmare post capitalism hell. I just think it's interesting this could be done now.

On finding out what teens want, that part is somewhat easy-ish, I guess you'd need a couple of agents, one that is scanning teen blogs for stories and then converting them to key words, then another agent that takes the key words (#taylorswift #HaileyBieberChiaPudding #latestkdrama etc) into Instagram, after a while your recommend page will turn into a pretty accurate representation of what teens are into, then just have an agent look at those images and generate difs of them. I doubt it would work for a bunch of reasons, but it's an interesting thought experiment! Thanks!

Hypothetically, sure, assuming the parent comment that these meshes are sufficient for modelling is correct and that you can find any teens who want a non-digital toy.

I think a good hobbyist application for this would be something like modelling figurines for games, which is already a pretty popular 3D printing application. This would allow people with limited modelling skills to bring fantastical, unique characters to life “easily”.

Pretty much. We're already generating images of monsters and characters for a D&D campaign; being able to print those in 3D would be pretty amazing.

1 replies

I think their suggestion was more "I have a photo of a cool horse, and now I would like a 3D model of that same horse."

Another way of looking at it, 3D artists often begin projects by taking reference images of their subject from multiple angles, then very manually turning that into a 3D model. That step could potentially be greatly sped up with an algorithm like this one. The artist could (hopefully) then focus on cleanup, rigging, etc, and have a quality asset in significantly less time.

The question is whether this actually "creates a 3d model based on the picture", or if it "finds an existing model that looks similar to the picture and texture map it".

9 replies

Im sorry for dumb lazy question. But would the input require more than one image? Is there a demo url to test this? I think it might jsut be time to buy a 3d printer.

EDIT> Does "single image inputs" mean more than one image?

Single image means one image.

3 replies

lol cmon guys don't be too hard on me it does say "inputs"

1 replies

I do see how "single image inputs" can be conflated with "multiple inputs of a single image each time", as opposed to "video".

0 replies

TBH I always look at the worst case scenario. I was worried it meant it need 3 images inputted as a single image at direct steps of the process, so requiring different angles. I wasn't sure, but thought best to check. I feel like it would have been clearer to have said something like " generates a 3d models from a single image". ( not exact wording but you catch my drift ). Sorry I am over analysing but all feedback is good right?

0 replies

Describe in single words only the good things that come into your mind about... your mother.

Can confirm the word single means 1

It's just a single image. It guesses the shape of the bits it can't see based on vast amounts of training data.

0 replies

Amazing! Thank you

I have an even lazier question after failing to speed-read the article.

Does this output an actual 3D mesh? Or does it only output a 3d-looking rendered animation?

3 replies

Does anyone know what hardware inference can run on or memory requirements?

It crashes with an out-of-memory error on my 24GB 4090, so at least when it comes to their sample script the answer is "a lot". Maybe it's just an inefficient implementation though.

0 replies

Pretty much every initial Stability release has been inefficient and has resources drop a lot when optimized for real consumer hardware community engines appeared for running the model.

OTOH, with their shift to a less open licensing structure, community tooling probably won’t emerge with the same level of energy.

In the repo the model weights file is 9.37GB, whereas sdxl turbo is 13.9GB, and I don't see any mention of huge context windows, so probably it just needs a decent graphics card.

3 replies

that demo animation is so clever and satisfying

But it doesn't look very realistic, tbh.

0 replies

it doesn't break Euclidian space at least

I can’t get them to play

2 replies

I wonder when Emad will be outed as a fed or a fraud. He's certainly leaving a trail of nasty behavior in the industry.

2 replies

Stable Video 3D (SV3D) is a generative model based on Stable Video Diffusion that takes in a still image of an object as a conditioning frame, and generates an orbital video of that object.

So can it actually output a 3d model? Or just images of what it thinks the object would look like from other angles?

The reference video ( says they use a NeRF / structure from motion and then create a mesh with marching cubes from the generated radiance field. This is how most soa text-to-object generators work now as well

I'm also struggling to find any examples of how to actually get a 3D model output. Very few references to this capability outside of the blog post.

2 replies

I'd like to play around with something like this, but from my understanding my machine (Macbook, 2021 M1) isn't nearly powerful enough (right?). Are there remote/cloud environments where I can run models like this?

1 replies

I suggest just using Stability's API. You aren't allowed to use it locally for commercial use anyway.

You could set something up on RunPod or AWS, but I doubt it's worth the effort.

Awesome, thank you!

It does look like SV3D is not a part of the API currently, but only a matter of time I imagine.

1 replies

Anyone know of anything that'll auto rig/add weights?

There are numerous tools that auto-rig humanoid figures. The obvious one:

1 replies

The emphasis here is Single Image, but can this model generate with multiple images too?

We know that a single image of an object physically can't cover all the sides of it, so it's all guesswork in AI. This is totally fine for certain scenario, but in lots of other cases, it's trivial to have multiple images of the same object, and if that offers higher fidelity, it's totally worth it.

I'm aware there are many algorithms or AI models that already do that. I'm asking about Stability's one specifically because if they have impressive Single Image result, surely their multi-image results would also be much better than state-of-the-art?

If it's not there yet, I'm willing to bet it will be soon enough given folks hacking it apart and injecting their own solutions.

1 replies

All the examples resemble plastic children's toys...

How would it handle other objects? (People, fabrics, buildings, plants, mountains, mechanical parts, etc)

It's hard to get camera position tracking for random objects, so it looks like they used simulations. There's probably a lot more plastic children's toy models in Blender than people, fabrics, buildings, &c.

1 replies

I can't wait until we can use something like this for architectural design

SDXL+Controlnet and then feeding it just blocked out depth maps are probably more useful for that.

They compare against Zero123-XL, but they should compare against MVDream instead. MVDream is quite good. If you fiddle with the loss you can get even better results.

Did you write the blog post using AI ?