For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:
* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.
* anyone can easily see the unrealistic and biased outputs without complex statistical tests.
* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)
* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.
* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.
* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).
* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.
I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.
I am biased (I work at Rev.com and Rev.ai), but I totally agree and would add one more thing: transcription. Accurate human transcription takes a really, really long time to do right. Often a ratio of 3:1-10:1 of transcriptionist time to original audio length.
Though ASR is only ~90-95% accurate on many "average" audios, it is often 100% accurate on high quality audio.
It's not only a cost savings thing, but there are entire industries that are popping up around AI transcription that just weren't possible before with human speed and scale.
Also the other way around: text to speech. We're at the point where I can finally listen to computer generated voice for extended periods of time without fatigue.
There was a project mentioned here on HN where someone was creating audio book versions of content in the public domain that would never have been converted through the time and expense of human narrators because it wouldn't be economically feasible. That's a huge win for accessibility. Screen readers are also about to get dramatically better.
Maybe this: https://news.ycombinator.com/item?id=40961385
That's the one! Thanks!
I’d add image to text - I use this all the time. For instance I’ll take a photo of a board or device and ChatGPT/claude/pick your frontier multi modal is almost always able to classify it accurately and describe details, including chipsets, pinouts, etc.
Is there any models that can do diarization well yet?
I need one for a product and the state of the art, e.g. pyannote, is so bad it's better to not use them.
Deepgram has been pretty good for our product. Fast and fairly accurate for English.
Do they have a local model?
I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally.
I agree. I think it's more of a niche use-case than image models (and fundamentally harder to evaluate), but transcription and summarization is my current front-runner for winning use-case of LLMs.
That said, "hallucination" is more of a fundamental problem for this area than it is for imagery, which is why I still think imagery is the most interesting category.
German Public Television switchednto Automatic transcriptions a few year back already.
I'd refrain from making any such statements about the future;* the pace of change makes it hard to see the horizon beyond a few years, especially relative to the span of a career. It's already wholesale-replacing many digital artists and editorial illustrators, and while it's still early, there's a clear push starting in the cinematography direction. (I fully agree with the rest of your comment, and it's strange how much diffusion models seem to be overlooked relative to LLMs when people think about AI progress these days.)
* (edit: about the future impact of AI on jobs).
I mean, my whole comment is a prediction of the future, so that's water under the bridge. Maybe you're right and this is the start of the apocalypse for digital artists, but it feels more like photoshop in 1990 to me -- and people were saying the same stuff back then.
I think you're going to need to cite some data on a claim like that. Maybe it's replacing the fiverr end of the market? It's certainly much harder to justify paying someone to generate a (bad) logo or graphic when a diffusion model can do the same thing, but there's no way that a model, today, can replace a skilled artist. Or said differently: a skilled artist, combined with a good AI model, is vastly more productive than an unskilled artist with the same model.
What happens when the AI takes the low end of the market is that the people who catered to the low end now have to try to compete more in the mid-to-high end. The mid end facing increased competition has to try to move up to the high end. So while AI may not be able to compete directly with the high end it will erode the negotiating power and thus the earning potential of the high end.
We have watched this same process repeat a few times over the last century with photography.
Or graphic design, or video editing, or audio mastering, or...every new tool has come with a bunch of people saying things like "what will happen to the linotype operators!?"
I sort of hate this line of argument, but it also has been manifestly true of the past, and rhymes with the present.
Pay 10 non skilled artist to do some bad job and we will complain about 10 bad logos. Now, for a fraction of the price, pay 10000 AI generated low quality logos and flood the market with them. Market expectations will go lower and suddenly your AI will be on par with the artists...
(in case you think the market will not behave like that, just have a look at how we produce low quality food and how many people are perfectly fine with that)...
Today a engenieer does the job of 100 thanks to computers.
Let me show you the future: https://www.youtube.com/watch?v=eVlXZKGuaiE
This is an LLM controlling an embodied VR body in a physics simulation.
It is responding to human voice input not only with voice but body movements.
Transformers aren't just chatbots, they are general symbolic manipulation machines. Anything that can be expressed as a series of symbols is a thing they can do.
No it's not. It's VAM that is controlling the character and it's literally just using a bog standard LLM as a chatbot and feeding the text into a plugin in VAM and VAM itself does the animation. Don't get me wrong it's absolutely next level to experience chatbots this way, but it's still a chat bot.
The animation, not the movement decisions.
This is as naive as calling an industrial robot 'just a calculator'.
The movement decisions are also just text from the LLM and are heavily coupled with what's available in the scene. It's not some free autonomous agent. Nor was the movement decisions trained any special type of tokens other than just text.
Yes and?
I would argue the opposite — image generation is the clear loser. If you've ever tried to do it yourself, grabbing a bunch of LoRAs from Civitai to try to convince a model to draw something it doesn't initially know how to draw — it becomes clear that there's far too much unavoidable correlation between "form" and "representation" / "style" going on in even a SOTA diffusion model's hidden layers.
Unlike LLMs, that really seem to translate the text into "concepts" at a certain embedding layer, the (current, 2D) diffusion models will store (and thus require to be trained on) a completely different idea of a thing, if it's viewed from a slightly different angle, or is a different size. Diffusion models can interpolate but not extrapolate — they can't see a prompt that says "lion goat dragon monster" and come up with the ancient-greek Chimera, unless they've actually been trained on a Chimera. You can tell them "asian man, blond hair" — and if their training dataset contains asian men and men with blonde hair but never at the same time, then they won't be able to "hallucinate" a blond asian man for you, because that won't be an established point in the model's latent space.
---
On a tangent: IMHO the true breakthrough would be a model for "text to textured-3D-mesh" — where it builds the model out of parts that it shapes individually and assembles in 3D space not out of tris, but by writing/manipulating tokens representing shader code (i.e. it creates "procedural art"); and then it consistency-checks itself at each step not just against a textual embedding, but also against an arbitrary (i.e. controlled for each layer at runtime by data) set of 2D projections that can be decoded out to textual embeddings.
(I imagine that such a model would need some internal "blackboard" of representational memory that it can set up arbitrarily-complex "lenses" for between each layer — i.e. a camera with an arbitrary projection matrix, through which is read/written a memory matrix. This would allow the model to arbitrarily re-project its internal working visual "conception" of the model between each step, in a way controllable by the output of each step. Just like a human would rotate and zoom a 3D model while working on it[1]. But (presumably) with all the edits needing a particular perspective done in parallel on the first layer where that perspective is locked in.)
Until we have something like that, though, all we're really getting from current {text,image}-to-{image,video} models is the parallel layered inpainting of a decently, but not remarkably exhaustive pre-styled patch library, with each patch of each layer being applied with an arbitrary Photoshop-like "layer effect" (convolution kernel.) Which is the big reason that artists get mad at AI for "stealing their work" — but also why the results just aren't very flexible. Don't have a patch of a person's ear with a big earlobe seen in profile? No big-earlobe ear in profile for you. It either becomes a small-earlobe ear or the whole image becomes not-in-profile. (Which is an improvement from earlier models, where just the ear became not-in-profile.)
[1] Or just like our minds are known to rotate and zoom objects in our "spatial memory" to snap them into our mental visual schemas!
The kind of granular, human-assisted interaction interface and workflow you're describing is, IMHO, the high-value path for the evolution of AI creative tools for non-text applications such as imaging, video and music, etc. Using a single or handful of images or clips as a starting place is good but as a semi-talented, life-long aspirational creative, current AI generation isn't that practically useful to me without the ability to interactively guide the AI toward what I want in more granular ways.
Ideally, I'd like an interaction model akin to real-time collaboration. Due to my semi-talent, I've often done initial concepts myself and then worked with more technically proficient artists, modelers, musicians and sound designers to achieve my desired end result. By far the most valuable such collaborations weren't necessarily with the most technically proficient implementers, but rather those who had the most evolved real-time collaboration skills. The 'soft skill' of interpreting my directional inputs and then interactively refining or extrapolating them into new options or creative combinations proved simply invaluable.
For example, with graphic artists I've developed a strong preference for working with those able to start out by collaboratively sketching rough ideas on paper in real-time before moving to digital implementation. The interaction and rapid iteration of tossing evolving ideas back and forth tended to yield vastly superior creative results. While I don't expect AI-assisted creative tools to reach anywhere near the same interaction fluidity as a collaboratively-gifted human anytime soon, even minor steps in this direction will make such tools far more useful for concepting and creative exploration.
...but I wasn't describing a "human-assisted interaction interface and workflow." I was describing a different way for an AI to do things "inside its head" in a feed-forward span-of-a-few-seconds inference pass.
Thanks for the correction. Not being well-versed in AI tech, I misinterpreted what you wrote and assumed it might enable more granular feedback and iteration.
I think you’re arguing about slightly different things. OP said that image generation is useful despite all its shortcomings, and that the shortcomings are easy to deal with for humans. OP didn’t argue that the image generation AIs are actually smart. Just that they are useful tech for a variety of use cases.
This is key, we’re all pre-wired with fast correctness tests.
Are there other data types that match this?
Software (I mean the product, not the code)
Mundane tasks that can be visually inspected at the end (cleaning, organizing, maintenance and mechanical work)
Audio to a lesser degree
Honestly, I am still to see an AI generated image that makes me go "oh wow". It's missing those 10 last percents that always seem to elude neural networks.
Also, the very bad press gen AI gets is very much slowing down adoption. Particularly among the creative-minded people, who would be the most likely users.
Hop on civitai
There's plenty of mindblowing images
LLM is a breakthrough for human to computer interface.
The knowledge answering is secondary in my opinion
I think it's easy to totally miss that LLMs are just being completely and quietly subsumed into a ton of products. They have been far more successful, and many image generation models use LLMs on the backend to generate "better" prompts for the models themselves. LLMs are the bedrock
I agree, but I'm a bit biased, our start-up www.sticky.study is in this space.
What we've seen over the last year trying out dozens of models and AI workflows, is that the fit of 1.) error tolerance of a model to 2.) its working context, is super important.
AI hallucinations break a lot of otherwise useful implementations. It's just not trustworthy enough. Even with AI imagery, some use cases require precision - AI photoshoots and brand advertising come to mind.
The sweet spot seems to be as part of a pipeline where the user only needs a 90% quality output. Or you have a human + computer workflow - a type of "Centaur" - similar to Moravec's Paradox.
Image models are a great way to understand generate AI. It's like surveying a battlefield from the air as opposed to the ground.