return to table of content

Stable Fast 3D: Rapid 3D Asset Generation from Single Images

timr
36 replies
1d2h

For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.

* anyone can easily see the unrealistic and biased outputs without complex statistical tests.

* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)

* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.

* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.

* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).

* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.

I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.

leetharris
9 replies
1d2h

For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

I am biased (I work at Rev.com and Rev.ai), but I totally agree and would add one more thing: transcription. Accurate human transcription takes a really, really long time to do right. Often a ratio of 3:1-10:1 of transcriptionist time to original audio length.

Though ASR is only ~90-95% accurate on many "average" audios, it is often 100% accurate on high quality audio.

It's not only a cost savings thing, but there are entire industries that are popping up around AI transcription that just weren't possible before with human speed and scale.

toddmorey
3 replies
1d1h

Also the other way around: text to speech. We're at the point where I can finally listen to computer generated voice for extended periods of time without fatigue.

There was a project mentioned here on HN where someone was creating audio book versions of content in the public domain that would never have been converted through the time and expense of human narrators because it wouldn't be economically feasible. That's a huge win for accessibility. Screen readers are also about to get dramatically better.

toddmorey
0 replies
1d1h

That's the one! Thanks!

fnordpiglet
0 replies
2h58m

I’d add image to text - I use this all the time. For instance I’ll take a photo of a board or device and ChatGPT/claude/pick your frontier multi modal is almost always able to classify it accurately and describe details, including chipsets, pinouts, etc.

llm_trw
2 replies
1d1h

Is there any models that can do diarization well yet?

I need one for a product and the state of the art, e.g. pyannote, is so bad it's better to not use them.

throw03172019
1 replies
1d

Deepgram has been pretty good for our product. Fast and fairly accurate for English.

llm_trw
0 replies
19h54m

Do they have a local model?

I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally.

timr
0 replies
1d2h

I agree. I think it's more of a niche use-case than image models (and fundamentally harder to evaluate), but transcription and summarization is my current front-runner for winning use-case of LLMs.

That said, "hallucination" is more of a fundamental problem for this area than it is for imagery, which is why I still think imagery is the most interesting category.

letomotolo
0 replies
23h33m

German Public Television switchednto Automatic transcriptions a few year back already.

mitthrowaway2
6 replies
1d2h

it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc.

I'd refrain from making any such statements about the future;* the pace of change makes it hard to see the horizon beyond a few years, especially relative to the span of a career. It's already wholesale-replacing many digital artists and editorial illustrators, and while it's still early, there's a clear push starting in the cinematography direction. (I fully agree with the rest of your comment, and it's strange how much diffusion models seem to be overlooked relative to LLMs when people think about AI progress these days.)

* (edit: about the future impact of AI on jobs).

timr
5 replies
1d2h

I mean, my whole comment is a prediction of the future, so that's water under the bridge. Maybe you're right and this is the start of the apocalypse for digital artists, but it feels more like photoshop in 1990 to me -- and people were saying the same stuff back then.

It's already wholesale-replacing many digital artists and editorial illustrators

I think you're going to need to cite some data on a claim like that. Maybe it's replacing the fiverr end of the market? It's certainly much harder to justify paying someone to generate a (bad) logo or graphic when a diffusion model can do the same thing, but there's no way that a model, today, can replace a skilled artist. Or said differently: a skilled artist, combined with a good AI model, is vastly more productive than an unskilled artist with the same model.

cjbgkagh
2 replies
1d1h

What happens when the AI takes the low end of the market is that the people who catered to the low end now have to try to compete more in the mid-to-high end. The mid end facing increased competition has to try to move up to the high end. So while AI may not be able to compete directly with the high end it will erode the negotiating power and thus the earning potential of the high end.

sroussey
1 replies
1d1h

We have watched this same process repeat a few times over the last century with photography.

timr
0 replies
20h29m

Or graphic design, or video editing, or audio mastering, or...every new tool has come with a bunch of people saying things like "what will happen to the linotype operators!?"

I sort of hate this line of argument, but it also has been manifestly true of the past, and rhymes with the present.

wiz21c
1 replies
11h3m

Pay 10 non skilled artist to do some bad job and we will complain about 10 bad logos. Now, for a fraction of the price, pay 10000 AI generated low quality logos and flood the market with them. Market expectations will go lower and suddenly your AI will be on par with the artists...

(in case you think the market will not behave like that, just have a look at how we produce low quality food and how many people are perfectly fine with that)...

lomase
0 replies
10h22m

Today a engenieer does the job of 100 thanks to computers.

llm_trw
4 replies
1d2h

For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI.

Let me show you the future: https://www.youtube.com/watch?v=eVlXZKGuaiE

This is an LLM controlling an embodied VR body in a physics simulation.

It is responding to human voice input not only with voice but body movements.

Transformers aren't just chatbots, they are general symbolic manipulation machines. Anything that can be expressed as a series of symbols is a thing they can do.

latentsea
3 replies
20h24m

This is an LLM controlling an embodied VR body in a physics simulation.

No it's not. It's VAM that is controlling the character and it's literally just using a bog standard LLM as a chatbot and feeding the text into a plugin in VAM and VAM itself does the animation. Don't get me wrong it's absolutely next level to experience chatbots this way, but it's still a chat bot.

llm_trw
2 replies
20h11m

The animation, not the movement decisions.

This is as naive as calling an industrial robot 'just a calculator'.

latentsea
1 replies
14h39m

The movement decisions are also just text from the LLM and are heavily coupled with what's available in the scene. It's not some free autonomous agent. Nor was the movement decisions trained any special type of tokens other than just text.

llm_trw
0 replies
7h58m

Yes and?

derefr
4 replies
1d2h

I would argue the opposite — image generation is the clear loser. If you've ever tried to do it yourself, grabbing a bunch of LoRAs from Civitai to try to convince a model to draw something it doesn't initially know how to draw — it becomes clear that there's far too much unavoidable correlation between "form" and "representation" / "style" going on in even a SOTA diffusion model's hidden layers.

Unlike LLMs, that really seem to translate the text into "concepts" at a certain embedding layer, the (current, 2D) diffusion models will store (and thus require to be trained on) a completely different idea of a thing, if it's viewed from a slightly different angle, or is a different size. Diffusion models can interpolate but not extrapolate — they can't see a prompt that says "lion goat dragon monster" and come up with the ancient-greek Chimera, unless they've actually been trained on a Chimera. You can tell them "asian man, blond hair" — and if their training dataset contains asian men and men with blonde hair but never at the same time, then they won't be able to "hallucinate" a blond asian man for you, because that won't be an established point in the model's latent space.

---

On a tangent: IMHO the true breakthrough would be a model for "text to textured-3D-mesh" — where it builds the model out of parts that it shapes individually and assembles in 3D space not out of tris, but by writing/manipulating tokens representing shader code (i.e. it creates "procedural art"); and then it consistency-checks itself at each step not just against a textual embedding, but also against an arbitrary (i.e. controlled for each layer at runtime by data) set of 2D projections that can be decoded out to textual embeddings.

(I imagine that such a model would need some internal "blackboard" of representational memory that it can set up arbitrarily-complex "lenses" for between each layer — i.e. a camera with an arbitrary projection matrix, through which is read/written a memory matrix. This would allow the model to arbitrarily re-project its internal working visual "conception" of the model between each step, in a way controllable by the output of each step. Just like a human would rotate and zoom a 3D model while working on it[1]. But (presumably) with all the edits needing a particular perspective done in parallel on the first layer where that perspective is locked in.)

Until we have something like that, though, all we're really getting from current {text,image}-to-{image,video} models is the parallel layered inpainting of a decently, but not remarkably exhaustive pre-styled patch library, with each patch of each layer being applied with an arbitrary Photoshop-like "layer effect" (convolution kernel.) Which is the big reason that artists get mad at AI for "stealing their work" — but also why the results just aren't very flexible. Don't have a patch of a person's ear with a big earlobe seen in profile? No big-earlobe ear in profile for you. It either becomes a small-earlobe ear or the whole image becomes not-in-profile. (Which is an improvement from earlier models, where just the ear became not-in-profile.)

[1] Or just like our minds are known to rotate and zoom objects in our "spatial memory" to snap them into our mental visual schemas!

mrandish
2 replies
23h0m

Until we have something like that...

The kind of granular, human-assisted interaction interface and workflow you're describing is, IMHO, the high-value path for the evolution of AI creative tools for non-text applications such as imaging, video and music, etc. Using a single or handful of images or clips as a starting place is good but as a semi-talented, life-long aspirational creative, current AI generation isn't that practically useful to me without the ability to interactively guide the AI toward what I want in more granular ways.

Ideally, I'd like an interaction model akin to real-time collaboration. Due to my semi-talent, I've often done initial concepts myself and then worked with more technically proficient artists, modelers, musicians and sound designers to achieve my desired end result. By far the most valuable such collaborations weren't necessarily with the most technically proficient implementers, but rather those who had the most evolved real-time collaboration skills. The 'soft skill' of interpreting my directional inputs and then interactively refining or extrapolating them into new options or creative combinations proved simply invaluable.

For example, with graphic artists I've developed a strong preference for working with those able to start out by collaboratively sketching rough ideas on paper in real-time before moving to digital implementation. The interaction and rapid iteration of tossing evolving ideas back and forth tended to yield vastly superior creative results. While I don't expect AI-assisted creative tools to reach anywhere near the same interaction fluidity as a collaboratively-gifted human anytime soon, even minor steps in this direction will make such tools far more useful for concepting and creative exploration.

derefr
1 replies
21h9m

...but I wasn't describing a "human-assisted interaction interface and workflow." I was describing a different way for an AI to do things "inside its head" in a feed-forward span-of-a-few-seconds inference pass.

mrandish
0 replies
16h40m

Thanks for the correction. Not being well-versed in AI tech, I misinterpreted what you wrote and assumed it might enable more granular feedback and iteration.

earthnail
0 replies
1d1h

I think you’re arguing about slightly different things. OP said that image generation is useful despite all its shortcomings, and that the shortcomings are easy to deal with for humans. OP didn’t argue that the image generation AIs are actually smart. Just that they are useful tech for a variety of use cases.

ibash
2 replies
1d2h

anyone can easily see the unrealistic outputs without complex statistical tests.

This is key, we’re all pre-wired with fast correctness tests.

Are there other data types that match this?

sounds
0 replies
1d1h

Software (I mean the product, not the code)

Mundane tasks that can be visually inspected at the end (cleaning, organizing, maintenance and mechanical work)

batch12
0 replies
1d2h

Audio to a lesser degree

thrance
1 replies
1d1h

Honestly, I am still to see an AI generated image that makes me go "oh wow". It's missing those 10 last percents that always seem to elude neural networks.

Also, the very bad press gen AI gets is very much slowing down adoption. Particularly among the creative-minded people, who would be the most likely users.

jokethrowaway
0 replies
1d

Hop on civitai

There's plenty of mindblowing images

letomotolo
0 replies
23h32m

LLM is a breakthrough for human to computer interface.

The knowledge answering is secondary in my opinion

kkukshtel
0 replies
1d2h

This general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI

I think it's easy to totally miss that LLMs are just being completely and quietly subsumed into a ton of products. They have been far more successful, and many image generation models use LLMs on the backend to generate "better" prompts for the models themselves. LLMs are the bedrock

kiwi_kim
0 replies
19h6m

I agree, but I'm a bit biased, our start-up www.sticky.study is in this space.

What we've seen over the last year trying out dozens of models and AI workflows, is that the fit of 1.) error tolerance of a model to 2.) its working context, is super important.

AI hallucinations break a lot of otherwise useful implementations. It's just not trustworthy enough. Even with AI imagery, some use cases require precision - AI photoshoots and brand advertising come to mind.

The sweet spot seems to be as part of a pipeline where the user only needs a 90% quality output. Or you have a human + computer workflow - a type of "Centaur" - similar to Moravec's Paradox.

CuriouslyC
0 replies
1d2h

Image models are a great way to understand generate AI. It's like surveying a battlefield from the air as opposed to the ground.

calini
10 replies
1d2h

I'm going to 3D print so much dumb stuff with this.

jsheard
8 replies
1d2h

They're still hesitant to show the untextured version of the models so I would assume it's like previous efforts where most of the detail is in the textures, and the model itself, the part you would 3D print, isn't so impressive.

jayd16
3 replies
1d1h

You know I do wonder about this. If its just for static assets does it really matter? In something like Unreal, the textures are going to be virtualized and the geometry is going to be turned in to LODed triangle soup anyway.

Has anyone tried to build an Unreal scene with these generated meshes?

jsheard
2 replies
1d1h

Usually the problem is the model itself is severely lacking in detail, sure Nanite could make light work of a poorly optimized model but it's not going to fix the model being a vague blob which doesn't hold up to close scrutiny.

kaibee
0 replies
1d1h

Generate the accompanying normal map and then just tesselate it?

andybak
0 replies
20h33m

So don't use them in a context where they require close scrutiny?

yazzku
2 replies
1d2h

I was going to comment on the same; these 3d reconstructions often generate a mess of a topology, and this post does not show any of the mesh triangulations, so I assume they're still not good. Arguably, the meshes are bad even for rendering.

dlivingston
1 replies
1d2h

Presumably, these meshes can be cleaned up using standard mesh refinement algorithms, like those found in MeshLab: https://www.meshlab.net/#features

Keyframe
0 replies
1d1h

Hopefully that's in the (near) future, but as of now there still exists 'retopo' in 3D work for a reason. Just like roto and similar menial tasks. We're getting there with automation though.

mft_
0 replies
1d2h

You can download a .glb file (from the HuggingFace demo page) and open it locally (e.g. in MS 3D Viewer). I'm looking at a mesh from one of the better examples I tried and it's actually pretty good...

fragmede
0 replies
1d

hueforge

bloopernova
5 replies
1d2h

Closer and closer to the automatic mapping drones from Prometheus.

I wonder what the optimum group of technologies is that would enable that kind of mapping? Would you pile on LIDAR, RADAR, this tech, ultrasound, magnetic sensing, etc etc. Although, you're then getting a flying tricorder. Which could enable some cool uses even outside the stereotypical search and rescue.

nycdatasci
1 replies
1d2h

High-res images from multiple perspectives should be sufficient. If you have a consumer drone, this product (no affiliation) is extremely impressive: https://www.dronedeploy.com/

You basically select an area on a map that you want to model in 3d, it flies your drone (take-off, flight path, landing), takes pictures, uploads to their servers for processing, generates point cloud, etc. Very powerful.

thetoon
0 replies
1d2h

What you could do with WebODM is already quite impressive

pzo
0 replies
1d

You already have depth anything v2 that can generate depthmap in realtime even on iPhone. Quality is pretty good but probably will be even improved in the future. Actually in many ways those depthmaps are much better quality than iPhone Lidar or Truedepth camera (that cannot handle transparent, metalic, reflective surfaces and also they have a big noise).

https://github.com/DepthAnything/Depth-Anything-V2

https://huggingface.co/spaces/pablovela5620/depth-compare

https://huggingface.co/apple/coreml-depth-anything-v2-small

alsodumb
0 replies
1d2h

Are you talking about mapping tunnels with drones? That's already done and it doesn't really need any 'AI': it's plain old SLAM.

DARPA's subterranean challenge had many teams that did some pretty cool stuff in this direction: https://spectrum.ieee.org/darpa-subterranean-challenge-26571...

woolion
4 replies
1d1h

This is the third image to 3D AI I've tested, and in all cases the examples they give look like 2D renders of 3D models already. My tests were with cel-shaded images (cartoony, not with realistic lighting) and the model outputs something very flat but with very bad topology, which is worse than starting with a low poly or extruding the drawing. I suspect it is unable to give decent results without accurate shadows from which the normal vectors could be recomputed and thus lacks any 'understanding' of what the structure would be from the lines and forms.

In any case it would be cool if they specified the set of inputs that is expected to give decent results.

quitit
2 replies
1d1h

It might not just be your tests.

All of my tests of img2mesh technologies have produced poor results, even when using images that are very similar to the ones featured in their demo. I’ve never got fidelity like what they’ve shown.

I’ll give this a whirl and see if it performs better.

woolion
0 replies
22h37m

All right, I was hesitating to try shading some images to see if that improves the quality. It's probably still too early.

quitit
0 replies
23h23m

Tried it with a collection of images, and in my opinion it performs -worse- than earlier releases.

It is however fast.

diggan
0 replies
23h26m

What stuck out to me from this release was this:

Optional quad or triangle remeshing (adding only 100-200ms to processing time)

But it seems to have been optional. Did you try it with that turned on? I'd be very interested in those results, as I had the same experience as you, the models don't generate good enough meshes, so was hoping this one would be a bit better at that.

Edit: I just tried it out myself on their Huggingface demo and even with the predefined images they have there, the mesh output is just not good enough. https://i.imgur.com/e6voLi6.png

puppycodes
2 replies
1d

I really can't wait for this technology to improve. Unfortunately just from testing this it seems not very useful. It takes more work to modify the bad model it approximates from the image output than starting with a good foundation from scratch. I would rather see something that took a series of steps to reach a higher quality end product more slowly instead of expecting everything to come from one image. Perhaps i'm missing the use case?

andybak
0 replies
20h34m

not very useful

Useful for what? I think use cases will emerge.

A lot of critiques assume you're working in VFX or game development. Making image to 3d (and by extension text to image to 3d) effortless a whole host of new applications open up - which might not be anywhere near so demanding.

MrTrvp
0 replies
23h31m

Perhaps it'll require a series of segmentation and transforms that improves individual components and then works up towards the full 3d model of the image.

ww520
1 replies
1d1h

This is a great step forward.

I wonder whether RAG based 3D animation generation can be done with this.

1. Textual description of a story.

2. Extract/generate keywords from the story using LLM.

3. Search and look up 2D images by the keywords.

4. Generate 3D models from the 2D images using Stable Fast 3D.

5. Extract/generate path description from the story using LLM.

6. Generate movement/animation/gait using some AI.

...

7. Profit??

nwoli
0 replies
1d1h

Pre generate a bunch of images via sdxl and convert to 3d and then serve nearest mesh after querying

kleiba
1 replies
1d2h

This is good news for the indie game dev scene, I suppose?

jayd16
0 replies
1d1h

The models aren't really optimized for game dev. Fine for machinima, probably.

voidUpdate
0 replies
11h25m

What I'd really like to see in these kinds of articles is examples of it not working as well. I don't necessarily want to see it being perfect, I'd quite like to see its limitations too

talldayo
0 replies
1d3h

0.5 seconds per 3D asset generation on a GPU with 7GB VRAM

Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.

specproc
0 replies
1d2h

Be still my miniature-painting heart.

quantumwoke
0 replies
1d2h

Great result. Just had a play around with the demo models and they preserve structure really nicely; although the textures are still not great. It's kind of a voxelized version of the input image

nextworddev
0 replies
1d1h

For those reading from Stability - just tried it - API seems to be down and the notebook doesn't have the example code it claimed to have.

mft_
0 replies
1d2h

I'm really excited for something in this area to really deliver, and it's really cool that I can just drag pictures into the demo on HuggingFace [0] to try it.

However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls.

[0] https://huggingface.co/spaces/stabilityai/stable-fast-3d

ksec
0 replies
1d

Given the Graphics Asset part of AA or AAA Games are the most expensive, i wonder if 3D Asset Generation could perhaps drastically lower that by 50% or more? At least in terms of same output. Because in reality I guess artist will just spend more time in other areas.

fsloth
0 replies
22h58m

Not the holy grail yet, but pretty cool!

I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.

You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.

causi
0 replies
1d

Man it would be so cool to get AI-assisted photogrammetry. Imagine that instead of taking a hundred photos or a slow scan and having to labor over a point cloud, you could just take like three pictures and then go down a checklist. "Is this circular? How long is this straight line? Is this surface flat? What's the angle between these two pieces?" and getting a perfect replica or even a STEP file out of it. Heaven for 3D printers.

Y_Y
0 replies
1d1h

It really looks like they've been doing that classic infomercial tactic of desaturating the images of the things they're comparing against to make theirs seen better.