return to table of content

Show HN: Infinity – Realistic AI characters that can speak

xpe
30 replies
16h26m

Given that I don't agree with many of Yann LeCun's stances on AI, I enjoyed making this:

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Hello I'm an AI-generated version of Yann LeCoon. As an unbiased expert, I'm not worried about AI. ... If somehow an AI gets out of control ... it will be my good AI against your bad AI. ... After all, what does history show us about technology-fueled conflicts among petty, self-interested humans?

pmarreck
29 replies
13h20m

it’s hard to disagree with him with any empirical basis when all of his statements seem empirically sound and all of his opponent’s AI Doomer statements seem like evidenceless FUD

I couldn’t help noticing that all the AI Doomer folks are pure materialists who think that consciousness and will can be completely encoded in cause-and-effect atomic relationships. The real problem is that that belief is BS until proven true. And as long as there are more good actors than bad, and AI remains just a sophisticated tool, the good effects will always outweigh the bad effects.

TeMPOraL
15 replies
10h54m

consciousness and will can be completely encoded in cause-and-effect atomic relationships. The real problem is that that belief is BS until proven true.

Wait. Isn't literally the exactly other way around? Materialism is the null hypothesis here, backed by all empirical evidence to date; it's all the other hypotheses presenting some kind of magic that are BS until proven.

h_tbob
6 replies
9h45m

A wise philosopher once said this.

You know your experience is real. But you do not know if the material world you see is the result of a great delusion by a master programmer.

Thus the only thing you truly know has no mass at all. Thus a wise person takes the immaterial as immediate apparent, but the physical as questionable.

You can always prove the immaterial “I think therefore I am”. But due to the uncertainty of matter, nothing physical can be truly known. In other words you could always be wrong in your perception.

So in sum, your experience has no mass, volume, or width. There are no physical properties at all to consciousness. Yet it is the only thing that we can know exists.

Weird, huh?

ziofill
0 replies
9h38m

Reminds me of Donald Hoffman’s perspective on consciousness

xpe
0 replies
6h56m

Philosophy as a field has been slow to take probability theory seriously. Trying to traffic in only certainty is a severe limitation.

scotty79
0 replies
8h29m

You can always prove the immaterial “I think therefore I am”. But due to the uncertainty of matter, nothing physical can be truly known.

But the brain that does the proving of immaterial is itself material so if matter is uncertain then the reasoning of the proof of immaterial can also be flawed thus you can't prove anything.

The only provable thing is that philosophers ask themselves useless questions, think about them long and hard building up convoluted narratives they claim to be proofs, but on the way they assume something stupid to move forward, which eventually leads to bogus "insights".

nkrisc
0 replies
8h8m

Yet empirically we know that if you physically disassemble the human brain, that person’s consciousness apparently creases to exist, as observed by the result on your rest of the body even if it remains otherwise intact. So it appears to arise from some physical properties of the brain.

I’m ignoring the argument that we can’t know if anything we’re perceive is even real at all since it’s unprovable and useless to consider. Better to just assume it’s wrong. And if that assumption is wrong, then it doesn’t matter.

TeMPOraL
0 replies
7h27m

You can always prove the immaterial “I think therefore I am”. But due to the uncertainty of matter, nothing physical can be truly known. In other words you could always be wrong in your perception.

Sure, you can prove that "I think therefore I am" for yourself. So how about we just accept it's true and put it behind us and continue to the more interesting stuff?

What you or I call external world, or our perception of it, has some kind of structure. There are patterns to it, and each of us seem to have some control over details of our respective perceptions. Long story short, so far it seems that materialism is the simplest framework you can use to accurately predict and control those perceptions. You and I both seem to be getting most mileage out of assuming that we're similar entities inhabiting and perceiving a shared universe that's external to us, and that that universe follows some universal patterns.

That's not materialism[0] yet, especially not in the sense relevant to AI/AGI. To get there, one has to learn about the existence of fields of study like medicine, or neuroscience, and some of the practical results they yielded. Things like, you poke someone's brain with a stick, watch what happens, and talk to the person afterwards. We've done that enough times to be fairly confident that a) brain is the substrate in which mind exists, and b) mind is a computational phenomenon.

I mean, you could maybe question materialism 100 years ago, back when people had the basics of science down but not much data to go on. It's weird to do in time and age when you can literally circuit-bend a brain like you'd do with an electronic toy, and get the same kind of result from the process.

--

[0] - Or physicalism or whatever you call the "materialism, but updated to current state of physics textbooks" philosophy.

Chance-Device
0 replies
8h16m

Descartes. And it’s pretty clear that consciousness is the Noumenon, just the part of it that is us. So if you want to know what the ontology of matter is, congratulations, you’re it.

dmd
6 replies
8h19m

While I agree 100% with you, everyone thinks that way about their own belief.

TeMPOraL
4 replies
7h45m

So let's put it differently.

True or not, materialism is the simplest, most constrained, and most predictive of the hypotheses that match available evidence. Why should we prefer a "physics + $magic" theory, for any particular flavor of $magic? Why this particular flavor? Why any flavor, if so far everything is explainable by the baseline "physics" alone?

Even in purely practical terms, it makes most sense to stick to materialism (at least if you're trying to understand the world; for control over people, the best theory needs not even be coherent, much less correct).

dmd
3 replies
7h30m

But the religious nuts will say "no, 'god did it' is the simplest, most constrained explanation".

I'm not arguing that they're correct. I'm saying that they believe that they are correct, and if you argue that they're not, well, you're back to arguing!

It's the old saw - you can't reason someone out of a position they didn't reason themself into.

TeMPOraL
2 replies
7h12m

But the religious nuts will say "no, 'god did it' is the simplest, most constrained explanation".

Maybe, but then we can still get to common ground by discussing a hypothetical universe that looks just like ours, but happen to not have a god inside (or lost it along the way). In that hypothetical, similar to yet totally-not-ours universe ruled purely by math, things would happen in a particular way; in that universe, materialism is the simplest explanation.

(It's up to religious folks then to explain where that hypothetical universe diverges from the real one specifically, and why, and how confident are they of that.)

dmd
1 replies
7h10m

You've never actually met a religious person, have you. :)

TeMPOraL
0 replies
7h7m

I used to be one myself :).

I do of course exclude people, religious or otherwise, who have no interest or capacity to process a discussion like this. We don't need 100% participation of humanity to discuss questions about what an artificial intelligence could be or be able to do.

xanderlewis
0 replies
8h11m

Yeah. One could equally imagine that dualism is the null hypothesis since human cultures around the world have seemingly universally believed in a ‘soul’ and that materialism is only a very recent phenomenon.

Of course, (widespread adoption of) science is also a fairly recent phenomenon, so perhaps we do know more now than we did back then.

pmarreck
0 replies
1h43m

You're right. Materialism IS the null hypothesis. And yet I know in my heart that its explanatory power is limited unless you want to write off all value, preference, feeling and meaning as "illusion", which amounts to gaslighting.

What if the reverse is true? The only real thing is actually irrationality, and all the rational materialism is simply a catalyst for experiencing things?

The answer to this great question has massive implications, not just in this realm, btw. For example, crime and punishment. Why are we torturing prisoners in prison who were just following their programming?

ben_w
4 replies
10h56m

To me, it seems like LeCun is missing the point of the (many and diverse) doom arguments.

The is no need for consciousness, there is only a need for a bug. It was purely luck that Nikita Khrushchev was in New York when Thule Site J mistook the moon for a soviet attack force.

There is no need for AI to seize power, humans will promote any given AI beyond the competency of that AI just as they already do with fellow humans ("the Peter principle").

The relative number of good and bad actors — even if we could agree on what that even meant, which we can't, especially with commons issues, iterated prisoners' dilemmas, and other similar Nash equilibria — doesn't help either way when the AI isn't aligned with the user.

(You may ask what I mean by "alignment", and in this case I mean vector cosine similarity "how closely will it do what the user really wants it to do, rather than what the creator of the AI wants, or what nobody at all wants because it's buggy?")

But even then, AI compute is proportional to how much money you have, so it's not a democratic battle, it's an oligarchic battle.

And even then, reality keeps demonstrating the incorrectness of the saying "the only way to stop a bad guy with a gun is a good guy with a gun", it's much easier to harm and destroy than to heal and build.

And that's without anyone needing to reach for "consciousness in the machines" (whichever of the 40-something definitions of "consciousness" you prefer).

Likewise it doesn't need plausible-future-but-not-yet-demonstrated things like "engineering a pandemic" or "those humanoid robots in the news right now, could we use them as the entire workforce in a factory to make more of them?"

ziofill
0 replies
9h30m

I agree. Also, I’ve heard LeCun arguing that a super intelligent AI wouldn’t be so “dumb” as to decide to do something terrible for humanity. So it will be 100% resistant to adversarial attacks? And malicious actors won’t ever train their own? And even if we don’t go all the way to super intelligence, is it riskless to progressively yield control to AIs?

xpe
0 replies
6h53m

Missing the point is a nice way of putting it. LeCun’s interests and position him to miss the point.

Personally, I view his takes on AI as unserious — in the sense that I have a hard time believing he really engages in a serious manner. The flaws of motivated reasoning and “early-stopping” are painfully obvious.

scotty79
0 replies
8h23m

Details are fun but the dilemma is: should the humanity seriously cripple itself (by avoiding AI) out of the fear it might hurt itself (with AI)? Are you gonna cut off your arm because you might hit yourself in the face with it in the future? The more useful the tool is, the more dangerous it is usually. Should we have killed all nuclear physicists before they figured out how to release nuclear energy? And even so.. would that prevent things or just delay things? Would it make us more or less prepared for the things to come?

pmarreck
0 replies
1h14m

Good points, and I prefer this version of the "AI Doomer" argument to the more FUD-infused ones.

One note: "It was purely luck that Nikita Khrushchev was in New York when Thule Site J mistook the moon for a soviet attack force." I cannot verify this story (ironically, I not only googled but consulted two different AI's, the brand-new "Reflection" model (which is quite impressive) as well as OpenAI's GPT4o... They both say that the Thule moon false alarm occurred a year after Khrushchev's visit to New York) Point noted though.

MattHeard
3 replies
12h12m

Hello, thank you for sharing your thoughts on this topic. I'm currently writing a blog post where the thesis is that the root disagreement between "AI doomers" and others is actually primarily a disagreement about materialism, and I've been looking for evidence of this disagreement in the wild. Thanks for sharing your opinion.

xpe
1 replies
7h1m

Really? You sound serious. I would recommend rethinking such a claim. There are simpler and more plausible explanations for why people view existential risk differently.

pmarreck
0 replies
1h35m

What are those? Because the risk is far higher if you believe that "will" is fundamentally materialist in nature. Those of us who do not (for whatever reason), do not evaluate this risk remotely as highly.

It is difficult to prove an irrational thing with rationality. How do we know that you and I see the same color orange (this is the concept of https://en.wikipedia.org/wiki/Qualia)? Measuring the wavelength entering our eyes is insufficient.

This is going to end up being an infinitely long HN discussion because it's 1) unsolvable without more data 2) infinitely interesting to any intellectual /shrug

pmarreck
0 replies
1h37m

If you look at the backgrounds of the list of people who have signed the "AI Doomer" manifesto (the one urging what I'd call an overly extreme level of caution), such at Geoffrey Hinton, Eliezer Yudkowsky, Elon Musk etc... you will find that they're all rational materialists.

I don't think the correlation is accidental.

So you're on to something, here. And I've felt the exact same way as you, here. I'd love to see your blog post when it's done.

nkrisc
1 replies
8h12m

I couldn’t help noticing that all the AI Doomer folks are pure materialists who think that consciousness and will can be completely encoded in cause-and-effect atomic relationships. The real problem is that that belief is BS until proven true.

It’s no less BS than the other beliefs which can be summed up as “magic”.

pmarreck
0 replies
1h25m

It’s no less BS than the other beliefs which can be summed up as “magic”.

So basically I have to choose between a non-dualist pure-materialist worldview in which every single thing I care about, feel or experience is fundamentally a meaningless illusion (and to what end? why have a universe with increasing entropy except for life which takes this weird diversion, at least temporarily, into lower entropy?), which I'll sarcastically call the "gaslighting theory of existence", and a universe that might be "materialism PLUS <undiscovered elements>" which you arrogantly dismiss as "magic" by conveniently grouping it together with arguably-objectively-ridiculous arbitrary religious beliefs?

Sounds like a false-dichotomy fallacy to me

xpe
0 replies
6h59m

Many people disagree with LeCun for reasoning having nothing to do with materialism. It is about logical reasoning over possible future scenarios.

ImHereToVote
0 replies
10h38m

It's a food thing our fate won't be sealed by a difference in metaphysical beliefs.

mjlbach
9 replies
17h19m

FYI dang they kinda ripped off our product down to copying the UI (Hedra.com). Our model is about 12x faster and supports 4 minute long videos…

shermantanktop
5 replies
16h9m

Fwiw, you’ve got one video on your homepage and everything else is locked behind a signup button.

I know that signup requirement is an article of faith amongst some startup types, but it’s not a surprise to me shareable examples lead to sharing.

mjlbach
4 replies
15h43m

We have a sign-up because we ensure users accept our terms of service and acceptable use policy before creating their first video, which affirms they understand how their data is used (legally required in most US states) and will not use our technology to cause harm.

squarefoot
0 replies
4h55m

Do you realize that this or similar technology will eventually end in every computer really soon? By building walls, you're essentially asking your potential users to go elsewhere. You should be as open as possible now that there is still room and time for competition.

mhuffman
0 replies
10h12m

legally required in most US states

Funny how other sites can do this with a birthday dropdown, an IP address, and a checkbox.

We have a sign-up because we ensure users accept our terms of service and acceptable use policy before creating their first video

So your company would have no problem going on record saying that they will never email you for any reason, including marketing, and your email will never be shared or sold even in the event of a merger or acquisition? Because this is the problem people have with sign-up ... and the main reason most start-ups want it.

I am not necessarily for or against required sign-ups, but I do understand people that are adamantly against them.

d13
0 replies
9h59m

99% of visitors will just hit the back button.

KTibow
0 replies
14h27m

You can have that without a sign up.

nprateem
1 replies
7h27m

This is such a lame comment. It reflects very badly on you company.

bschmidt1
0 replies
5h27m

Especially considering how many people are attempting something similar - for example everyone copied ChatGPT's UI.

Will be funny/ironic when the first AI companies start suing each other for copyright infringement.

Personally for me the "3 column" UI isn't that good anyway, I would have gone with an "MMO Character Creation" type UX for this.

the__alchemist
0 replies
4h40m

This thread has opened my eyes to how many similar products exist; beyond your companies' and OP's. Was yours the first? Could the other companies make the same claim about yours? Do you make the same claim about the others?

lcolucci
2 replies
1d

Love this one as well. It's a painting of Trithemius, a German monk, who actually said that

klipt
1 replies
1d

Although I assume he didn't say it in British English ;-)

lcolucci
0 replies
1d

No, probably not haha ;-)

naveensky
12 replies
1d

Is it similar to https://loopyavatar.github.io/. I was reading about this today and even the videos are exactly the same.

I am curious if you are anyway related to this team?

aaroninsf
5 replies
1d

Either this is the commercialization of the work of that project, by authors or collaborators,

or it is appears to be a straight up grift, wrapping someone else's work with a SPA website.

I don't see other possibilities.

sidneyprimas
1 replies
1d

We are not related to Loopy Avatar. We trained our own models. It's a coincidence that they launched yesterday.

In the AI/research community, people often try to use the same examples so that it's easier to compare performance across different models.

echelon
0 replies
1d

You should watch out for Hedra and Sync. Plus a bunch of Loopy activity on Discord.

zaptrem
0 replies
1d

I know these guys in real life, they've been working on this for months and, unlike the ByteDance paper, have actually shipped something you can try yourself.

robertlagrant
0 replies
6h10m

Not seeing other possibilities isn't great though, right? Clearly there are other possibilities.

lcolucci
3 replies
1d

No, not related. We just took some of Loopy's demo images + audios since they came out 2 days ago and people were aware of them. We want to do an explicit side-by-side at some point, but in the meantime people can make their own comparisons, i.e. compare how the two models perform on the same inputs.

Loopy is a Unet-based diffusion model, ours is a diffusion transformer. This is our own custom foundation model we've trained.

arcticfox
2 replies
1d

This took me a minute - your output demos are your own, but you included some of their inputs, to make for an easy comparison? Definitely thought you copied their outputs at first and was baffled.

sidneyprimas
0 replies
1d

Yes, exactly! We just wanted to make it easy to compare. We also used some inputs from other famous research papers for comparison (EMO and VASA). But all videos we show on our website/blog are our own. We don't host videos from any other model on our website.

Also, Loopy is not available yet (they just published the research paper). But you can try our model today, and see if it lives up to the examples : )

lcolucci
0 replies
1d

Exactly. Most talking avatar papers re-use each others images + audios in their demo clips. It's just a thing everyone does... we never thought that people would think it means we didn't train our own model!

For whoever wants to, folks can re-make all the videos themselves with our model by extracting the 1st frame and audio.

vunderba
0 replies
1d

It was posted to hacker news as well within the last day.

https://news.ycombinator.com/item?id=41463726

Examples are very impressive, here's hoping we get an implementation of it on huggingface soon so we can try it out, and even potentially self-host it later.

cchance
0 replies
1d

Holy shit loopy is good, i imagine another closed model, opensource never gets good shit like that :(

eth0up
0 replies
19h20m

I had difficulty getting my lemming to speak. After selecting several alternatives, I tried one with a more defined, open mouth, which required multiple attempts but mostly worked. Additional iterations on the same image can produce different results.

andrew-w
0 replies
21h8m

Cartoons are definitely a limitation of the current model.

lcolucci
0 replies
22h35m

I think you've made the 1st ever talking dog with our model! I didn't know it could do that

lcolucci
0 replies
23h23m

Nice! Earlier checkpoints of our model would "gender swap" when you had a female face and male voice (or vice versa). It's more robust to that now, which is good, but we still need to improve the identity preservation

layer8
0 replies
22h55m

The jaw is particularly unsettling somehow.

aramndrt
8 replies
1d1h

Quick tangent: Does anybody know why many new companies have this exact web design style? Is it some new UI framework or other recent tool? The design looks sleek, but they all appear so similar.

bearjaws
4 replies
1d1h

My sad millennial take is: We're in the brain rot era, if a piece of content doesn't have immediate animation / video and that "wowww" sound byte nobody pays attention.

https://www.youtube.com/watch?v=Xp2ROiFUZ6w

stevenpetryk
3 replies
1d

My happy millennial take is that browsers have made strides in performance and flexibility, and people are utilizing that to build more complex and dynamic websites.

Simplicity and stillness can be beautiful, and so can animations. Enjoying smooth animations and colorful content isn’t brain rot imo.

whyslothslow
2 replies
1d

It may be unpopular, but my opinion is that web pages must not have non-consensual movement.

I’ll begrudgingly accept a default behavior of animations turned on, but I want the ability to stop them. I want to be able to look at something on a page without other parts of the page jumping around or changing form while I’m not giving the page any inputs.

For some of us, it’s downright exhausting to ignore all the motion and focus on the, you know, actual content. And I hate that this seems to be the standard for web pages these days.

I realize this isn’t particularly realistic or enforceable. But one can dream.

washadjeffmad
0 replies
6h11m

I've seen some site behaviors "rediscovered" lately that have both grated and tickled me because it's apparent the designers are too young to have been a part of the conversations from before the Web was Won.

They can't fathom what a world without near infinite bandwidth, low latency and load times, and disparate hardware and display capabilities with no graphical acceleration looks like, or why people wouldn't want video and audio to autoplay, or why we don't do flashing banners. They think they're distinguishing themselves using variations on a theme, wowing us with infinitely scrolling opuses when just leaving out the crap would do.

I still aim to make everything load within in a single packet, and I'll happily maintain my minority position that that's the true pinnacle of web design.

mnahkies
0 replies
23h36m

For sites that have paid enough attention to accessibility you might be able to configure our browser/OS such that this media query applies https://developer.mozilla.org/en-US/docs/Web/CSS/@media/pref... - it's designed to encourage offering low motion alternatives

sidneyprimas
0 replies
1d

It's much easier to use standard CSS packages, and these come with more standard styles. Our team doesn't have much experience building websites, so we just went with the standard styles. We used TailwindCSS.

ricardobeat
0 replies
23h11m

Designers today are largely driven by trends (just like engineering?). Being cool = jumping on the latest bandwagon, not being unique or better. The good news is this particular style is pretty much restricted to tech companies, I think it started with https://neon.tech a few years ago or a similar startup.

Incidentally, the same behaviour is seen in academia. These websites for papers are all copying this one from 2020: https://nerfies.github.io/

lcolucci
0 replies
1d

Do you mean on the infinity.ai site or studio.infinity.ai? On infinity.ai we just wanted something fast and easy. This is MagicUI

lcolucci
7 replies
23h45m

This is a bug in the model we're aware of but haven't been able to fix yet. It happens at the end of some videos but not all.

Our hypothesis is that the "breakdown" happens when there's a sudden change in audio levels (from audio to silence at the end). We extend the end of the audio clip and then cut it out the video to try to handle this, but it's not working well enough.

drhodes
6 replies
22h49m

just an idea, but what if the appended audio clip was reversed to ensure continuity in the waveform? That is, if >< is the splice point and CLIP is the audio clip, then the idea would be to construct CLIP><PILC.

andrew-w
5 replies
22h47m

This is exactly what we do today! It seems to work better the more you extend it, but extending it too much introduces other side effects (e.g. the avatar will start to open its mouth, as if it were preparing to talk).

drhodes
4 replies
22h30m

Hmm, maybe adding white noise would work. -- OK, that's quite enough unsolicited suggestions from me up in the peanut gallery. Nice job on the website, it's impressive, thank you for not requiring a sign up.

andrew-w
3 replies
22h13m

All for suggestions! We've tried white noise as well, but it only works on plain talking samples (not music, for example). My guess is that the most robust solution will come from updating how it's trained.

bobbylarrybobby
2 replies
18h19m

What if you train it to hold the last frame on silence (or quiet noise)?

andrew-w
1 replies
17h55m

We've talked about doing something like that. Feels like it should work in theory.

jazzyjackson
0 replies
16h30m

Or noise corresponding with a closed mouth

Hmmmmmmmm

Ohmmmmmmm

zoogeny
6 replies
19h7m

I am actively working in this area from a wrapper application perspective. In general, tools that generate video are not sufficient on their own. They are likely to be used as part of some larger video-production workflow.

One drawback of tools like runway (and midjourney) is the lack of an API allowing integration into products. I would love to re-sell your service to my clients as part of a larger offering. Is this something you plan to offer?

The examples are very promising by the way.

andrew-w
5 replies
18h35m

I agree, I think power users are happy to go to specific platforms, but APIs open up more use cases that can reach a broader audience. What kind of application would you use it for? We don't have specific plans at the moment, but are gauging interest.

zoogeny
2 replies
18h7m

I'm looking to create an end-to-end story telling interface. I'm currently working on the MVP and my plan was just to generate the prompts and then require users to manually paste those prompts into the interfaces of products that don't support APIs and then re-upload the results. This is so far below ideal that I'm not sure it will sell at all. It is especially difficult if one tries to imagine a mobile client. Given the state of the industry it may be acceptable for a while, but ideally I can just charge some additional margin on top of existing services and package that as credits (monthly plan + extras).

Consider all of the assets someone would have to generate for a 1 minute video. Lets assume 12 clips of 5 seconds each. First they may have to generate a script (claude/openai). They will have to generate text audio and background/music audio (suno/udio). They probably have to generate the images (runway/midjourney/flux/etc) which they will feed into a img2vid product (infinity/runway/kling/etc). Then they need to do basic editing like trimming clip lengths. They made need to add text/captions and image overlays. Then they want to upload it to TikTok/YouTube/Instagram/etc (including all of the metadata for that). Then they will want to track performance, etc.

That is a lot of UI, workflows, etc. I don't think a company such as yours will want to provide all of that glue. And consumers are going to want choice (e.g. access to their favorite image gen, their favorite text-to-speech).

Happy to talk more if you are interested. I'm at the prototype stage currently. As an example, consider the next logical step for an app like https://autoshorts.ai/

andrew-w
0 replies
18h2m

Makes sense, thank you!

35mm
0 replies
3h57m

I am doing this in a semi-automated way right now based on a voiceover of me speaking.

It would be very useful to have API access to Infinity to automate the creation of a talking head avatar.

bhanu423
1 replies
18h6m

Hopping onto the original comment - I am building an video creation platform focused on providing accessible education to the masses in developing countries. Would love to integrate something like this into our platform. Would love to pay for an API access and so will so many others. Please consider opening API, you would make lot of money right now which can be used for your future plans.

andrew-w
0 replies
18h0m

Cool use case! Thanks for sharing your thoughts.

mjlbach
5 replies
17h15m

Lina, I welcome competition but I can’t support this. This is a blatant copy of our UI (hedra.com) and you are advertising with celebrity deepfakes.

I strongly suggest you follow our example and roll out biometric privacy notices and block celebrity content.

zaptrem
0 replies
13h52m

Neither your nor their model is remotely close to actually fooling anymore, so celebs could only be used for (very funny) obvious satire. I see no risk of harm here.

Also, two boxes for uploading the only two inputs to a model is not a new idea. One could say you stole it from Gradio (but even that's silly).

supermatt
0 replies
10h22m

While I agree there are potential issues with using celebrity images, their UI is effectively no different to any of the 326432+ examples of handling model input on huggingface spaces.

saberience
0 replies
11h11m

There's nothing particularly original about the UI, it's literally just a basic image upload and sound upload. I can easily see every hyperscaler AI firm offering something similar within one year so no need to get on your high horse about this.

rerdavies
0 replies
11h8m

@mjlbach: Weirdly broken web page. You can't even see samples without creating an account.

batch12
0 replies
15h23m

What makes celebrity deepfakes worse than pleb deepfakes?

jl6
5 replies
22h59m

Say I’m a politician who gets caught on camera doing or saying something shady. Will your service do anything to prevent me from claiming the incriminating video was just faked using your technology? Maybe logging perceptual hashes of every output could prove that a video didn’t come from you?

bee_rider
3 replies
22h37m

These sort of models are probably going to end up released as publicly available weights at some point, right? Or, if it can be trained for $500k today, how much will it cost in a couple years? IMO we can’t stuff this genie back in the bottle, for better or worse. A video won’t be solid evidence of much within our lifetimes.

sidneyprimas
1 replies
21h38m

That's how I see it as well. Very soon, people will assume most videos are AI generated, and the burden of prove will be on people claiming videos are real. We plan to embed some kind of hash to indicate our video is AI generated, but people will be able to get around this. Google/Apple/Samsung seem to be in the best place to solve this: whenever their devices record a real video, they can generate a hash directly in HW for that video, which can be used to verify that it was actually recorded by that phone.

Also, I think it will cost around $100k to train a model at this quality level within 1-2 years. And, will only go down from there. So, the genie is out of the bag.

bee_rider
0 replies
21h11m

That makes sense. It isn’t reasonable to expect malicious users to helpfully set the “evil bit,” but you can at least add a little speedbump by hashing your own AI generated content (and the presence of videos that are verifiably AI generated will at least probably catch some particularly lazy/incompetent bad actors, which will destroy their credibility and also be really funny).

In the end though, the incentive and the capability lies in the hands of camera manufacturers. It is unfortunate that video from the pre-AI era have no real reason to have been made verifiable…

Anyway, recordings of politicians saying some pretty heinous things haven’t derailed some of their campaigns anyway, so maybe none of this is really worth worrying about in the first place.

sidneyprimas
0 replies
21h36m

Ya, it's only a matter of time until very high quality video models will be open sourced.

chipsrafferty
0 replies
14h14m

I think you're fine because these videos don't look or sound the least bit realistic

sroussey
4 replies
1d1h

I look forward to movies that are dubbed moving the face+lips to the dubbed text. Also using the original actors voice.

schrijver
0 replies
23h51m

I thought the larger public was starting to accept subtitles so I was hoping we’d rather see the end of dubbed movies !

lcolucci
0 replies
1d1h

agreed!

foreigner
0 replies
1d

Wow that would be very cool.

SwiftyBug
0 replies
1d

+1 for the lips matching the dubbed speech, but I'm not sure about cloning the actor's voice. I really like dubbing actor's unique voices and how they become the voice of some characters in their language.

naveensky
4 replies
1d

Is there any limitation on the video length?

lcolucci
3 replies
1d

Our transformer model was trained to generate videos that are up to 8s in length. However, we can make videos that are longer by using it an an autoregressive manner, and taking the last N frames of output i to seed output (i+1). It is important to use more than just 1 frame. Otherwise ,the direction of movement can suddenly change, which looks very uncanny. Admittedly, the autoregressive approach tends to accumulate errors with each generation.

It is also possible to fine-tine the model so that single generations (one forward pass of the model) are longer than 8s, and we plan to do this. In practice, it just means our batch sizes have to be smaller when training.

Right now, we've limited the public tool to only allow videos up to 30s in length, if that is what you were asking.

leobg
1 replies
23h7m

Video compression algorithms use key frames. So can’t you do the same thing? Essentially, generate five seconds. Then pull out the last frame. Use some other AI model to enhance it (upscale, consistency with the original character, etc.). Then use that as the input for the next five seconds?

andrew-w
0 replies
22h56m

This is a good idea. We have discussed incorporating an additional "identity" signal to the conditioning, but simply enforcing consistency with the original character as a post-processing step would be a lot easier to try. Are there any tools you know of that do that?

naveensky
0 replies
1d

Thanks for answering this. I would love to use it when APIs are available to integrate with my apps

dorianmariefr
4 replies
1d

quite slow btw

andrew-w
3 replies
1d

Yeah, it's about 5x slower than realtime with the current configuration. The good news is that diffusion models and transformers are constantly benefitting from new acceleration techniques. This was a big reason we wanted to take a bet on those architectures.

Edit: If we generate videos at a lower resolution and with a fewer number of diffusion steps compared to what's used in the public configuration, we are able to generate videos at 20-23 fps, which is just about real-time. Here is an example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...

ilaksh
1 replies
22h23m

Wowww.. can you buy more hardware and make a realtime websocket API?

andrew-w
0 replies
19h26m

It's something we're thinking about. Our main focus right now is to make the model as good as it can be. There are still many edge cases and failure modes.

lcolucci
0 replies
23h3m

Woah that's a good find Andrew! That low-res video looks pretty good

bschmidt1
4 replies
23h9m

Amazing work! This technology is only going to improve. Soon there will be an infinite library of rich and dynamic games, films, podcasts, etc. - a totally unique and fascinating experience tailored to you that's only a prompt away.

I've been working on something adjacent to this concept with Ragdoll (https://github.com/bennyschmidt/ragdoll-studio), but focused not just on creating characters but producing creative deliverables using them.

lcolucci
3 replies
22h11m

Very cool! If we release an API, you could use it across the different Ragdoll experiences you're creating. I agree personalized character experiences are going to be a huge thing. FYI we plan to allow users to save their own characters (an image + voice combo) soon

bschmidt1
2 replies
20h3m

If we release an API, you could use it

Absolutely, especially if the pricing makes sense! Would be very nice to just focus on the creative suite which is the real product, and less on the AI infra of hosting models, vector dbs, and paying for GPU.

Curious if you're using providers for models or self-hosting?

andrew-w
1 replies
19h24m

We use Modal for cloud compute and autoscaling. The model is our own.

bschmidt1
0 replies
19h4m

Amazing, great to hear it :)

LarsDu88
4 replies
1d1h

Putting Drake as a default avatar is just begging to be sued. Please remove pictures of actual people!

sidneyprimas
1 replies
1d

Ya, this is tricky. Our stance is the people should be able to make funny, parody videos with famous people.

thih9
0 replies
5h15m

Is that legal? As in: can you use an image of a celebrity without their consent as part of the product demo?

stevenpetryk
0 replies
1d

That would be ironic given how Drake famously performed alongside an AI recreation of Pac.

bongodongobob
0 replies
1d

Sounds like free publicity to me.

Andrew_nenakhov
4 replies
1d

i wonder how long would it take for this technology to advance to a point where nice people from /r/freefolk would be able to remake seasons 7 and 8 of Game of Thrones to have a nice proper ending? 5 years, 10?

squarefoot
1 replies
21h40m

In a few years we'll have entire shows made exclusively by AI.

DistractionRect
0 replies
21m

In one hand... But on the other, there's soo many shows that got canceled or just got a really shitty ending that could be rewritten. Kinda looking forward to it.

lcolucci
0 replies
23h16m

I'd say the 5 year ballpark is about right, but it'll involve combining a bunch of different models and tools together. I follow a lot of great AI filmmakers on Twitter. They typically make ~1min long videos using 3-8 different tools... but even those 1min videos were not possible 9 months ago! Things are moving fast

andrew-w
0 replies
23h7m

Haha, wouldn't we all love that? In the long run, we will definitely need to move beyond talking heads, and have tools that can generate full actors that are just as expressive. We are optimistic that the approach used in our V2 model will be able to get there with enough compute.

yellowapple
3 replies
21h50m

As soon as I saw the "Gnome" face option I gnew exactly what I gneeded to do: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

EDIT: looks like the model doesn't like Duke Nukem: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Cropping out his pistol only made it worse lol: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

A different image works a little bit better, though: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

andrew-w
0 replies
21h39m

This is why we do what we do lol

ainiriand
0 replies
21h9m

Haha I almost wake up my kid with my sudden laugh!

ianbicking
3 replies
1d

The actor list you have is so... cringe. I don't know what it is about AI startups that they seem to be pulled towards this kind of low brow overly online set of personalities.

I get the benefit of using celebrities because it's possible to tell if you actually hit the mark, whereas if you pick some random person you can't know if it's correct or even stable. But jeez... Andrew Tate in the first row? And it doesn't get better as I scroll down...

I noticed lots of small clips so I tried a longer script, and it seems to reset the scene periodically (every 7ish seconds). It seems hard to do anything serious with only small clips...?

sidneyprimas
2 replies
1d

Thanks for the feedback! The good news is that the new V2 model will allow people to create their own actors very easily, and so we won't be restricted to the list. You can try that model out here: https://studio.infinity.ai/

The rest of our website still uses the V1 model. For the V1 model, we had to explicitly onboard actors (by fine-tuning our model for each new actor). So, the V1 actor list was just made based on what users were asking for. If enough users asked for an actor, then we would fine-tune a model for that actor.

And yes, the 7s limit on v1 is also a problem. V2 right now allows for 30s, and will soon allow for over a minute.

Once V2 is done training, we will get it fully integrated into the website. This is a pre-release.

ianbicking
1 replies
23h33m

Ah, I didn't realize I had happened upon a different model. Your actor list in the new model is much more reasonable.

I do hope more AI startups recognize that they are projecting an aesthetic whether they want to or not, and try to avoid the middle school boy or edgelord aesthetic, even if that makes up your first users.

Anyway, looking at V2 and seeing the female statue makes me think about what it would be like to take all the dialog from Galatea (https://ifdb.org/viewgame?id=urxrv27t7qtu52lb) and putting it through this. [time passes :)...] trying what I think is the actual statue from the story is not a great fit, it feels too worn by time (https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...). But with another statue I get something much better: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

One issue I notice in that last clip, and some other clips, is the abrupt ending... it feels like it's supposed to keep going. I don't know if that's an artifact of the input audio or what. But I would really like it if it returned to a kind of resting position, instead of the sense that it will keep going but that the clip was cut off.

On a positive note, I really like the Failure Modes section in your launch page. Knowing where the boundaries are gives a much better sense of what it can actually do.

andrew-w
0 replies
22h45m

Very creative use cases!

We are trying to better understand the model behavior at the very end of the video. We currently extend the audio a bit to mitigate other end-of-video artifacts (https://news.ycombinator.com/item?id=41468520), but this can sometimes cause uncanny behavior similar to what you are seeing.

cchance
3 replies
1d

I tried with the drake and drake saying some stuff and while its cool, its still lacking, like his teeth are disappearing partially :S

sidneyprimas
1 replies
1d

Agreed! The teeth can be problematic. The good news is we just need to train at higher resolution (right now we are at 320x320px), and that should resolve the teethe issue.

So far, we have purposely trained on low resolution to make sure we get the gross expressions / movements right. The final stage of training with be using higher resolution training data. Fingers crossed.

gessha
0 replies
15h26m

Realistic teeth in lipsync videos based purely on data and without explicit priors would be tough.

Good luck :)

andrew-w
0 replies
1d

Thanks for the feedback. The current model was trained at ~320x320 resolution. We believe going higher will result in better videos with finer detail, which we plan to do soon.

bufferoverflow
3 replies
20h28m

It completely falls apart on longer videos for me, unusable over 10 seconds.

mjlbach
1 replies
17h47m

You can try us (Hedra.com) we support 4 minute videos and have been out a couple months now.

bufferoverflow
0 replies
15h39m

Oh wow. Much slower, but much higher quality. Which I definitely prefer.

Thank you!

lcolucci
0 replies
19h33m

This is a good observation. Can you share the videos you’re seeing this with? For me, normal talking tends to work well even on long generations. But singing or expressive audio starts to devolve with more recursions (1 forward pass = 8 sec). We’re working on this.

lcolucci
0 replies
23h39m

that's a great one!

knodi123
0 replies
21h30m

So if we add autotune....

kelseyfrog
0 replies
23h5m

Big Dracula Flow energy which is not bad :)

lcolucci
0 replies
1d

Nice :) It's been really cool so see the model get more and more expressive over time

andrew-w
0 replies
22h19m

I don't think we've seen laughing quite that expressive before. Good find!

vessenes
2 replies
21h48m

Hi Lina, Andrew and Sidney, this is awesome.

My go-to for checking the edges of video and face identification LLMs are Personas right now -- they're rendered faces done in a painterly style, and can be really hard to parse.

Here's some output: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Source image from: https://personacollective.ai/persona/1610

Overall, crazy impressive compared to competing offerings. I don't know if the mouth size problems are related to the race of the portrait, the style, the model, or the positioning of the head, but I'm looking forward to further iterations of the model. This is already good enough for a bunch of creative work, which is rad.

lcolucci
1 replies
21h37m

I didn't know about Persona Collective - very cool!

I think the issues in your video are more related to the style of the image and the fact that she's looking sideways than the race. In our testing so far, it's done a pretty good job across races. The stylized painting aesthetic is one of the harder styles for the model to do well on. I would recommend trying with a straight on portrait (rather than profile) and shorter generations as well... it might do a bit better there.

Our model will also get better over time, but I'm glad it can already be useful to you!

vessenes
0 replies
21h8m

It's not portrait orientation or gender specific or length related: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

It's not stylization (alone): here's a short video using the same head proportions as the original video, but the photo style is a realistic portrait. I'd say the mouth is still overly wide. https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

I tentatively think it might be race related -- this is one done of a different race. Her mouth might also be too wide? But it stands out a bit less to me. https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

p.s. happy to post a bug tracker / github / whatever if you prefer. I'm also happy to license over the Persona Collective images if you want to pull them in for training / testing -- : feel free to email me -- there's a move away from 'painterly' style support in the current crop of diffusion models (flux for instance absolutely CANNOT do painting styles), and I think that's a shame.

Anyway, thanks! I really like this.

Sakos
0 replies
8h48m

It's like I'm watching him on the news

ASalazarMX
0 replies
19m

Seems like some longer videos gradually slip into the uncanny valley.

lcolucci
0 replies
22h37m

Wow this worked so well! Sometimes with long hair and paintings, it separates part of the hair from the head but not here

andrew-w
0 replies
22h38m

Thank you! It has learned a surprising amount of world knowledge.

strogonoff
2 replies
16h28m

Is it fairly trained?

b0ner_t0ner
1 replies
15h58m

You think Kanye approved this?

strogonoff
0 replies
3h15m

You think every musician personally approves every use of their work?

nickfromseattle
2 replies
1d

I need to create a bunch of 5-7 minute talking head videos. What's your timeline for capabilities that would help with this?

lcolucci
0 replies
23h40m

Our model can recursively extend video clips, so theoretically we could generate your 5-7min talking head videos today. In practice, however, error accumulates with each recursion and the video quality gets worse and worse over time. This is why we've currently limited generations to 30s.

We're actively working on improving stability and will hopefully increase the generation length soon.

btbuildem
0 replies
4h38m

Could you not do that today, with the judicious use of cuts and transitions?

naveensky
2 replies
1d

For such models, is it possible to fine-tune models with multiple images of the main actor?

Sorry, if this question sounds dumb, but I am comparing it with regular image models, where the more images you have, the better output images you generate for the model.

andrew-w
1 replies
1d

It is possible to fine-tune the model with videos of a specific actor, but not images. You need videos to train the model.

We actually did this in early overfitting experiments (to confirm our code worked!), and it worked surprisingly well. This is exciting to us, because it means we can have actor-specific models that learn the idiosyncratic gestures of particular person.

naveensky
0 replies
11h56m

This is actually great, waiting for your API integration or replicate integration to get my hands dirty :)

max4c
2 replies
22h26m

This is amazing and another moment where I question what the future of humans will look like. So much potential for good and evil! It's insane.

lcolucci
0 replies
21h12m

thank you! it's for sure an interesting time to be alive... can't complain about it being boring

jaysonelliot
0 replies
16h15m

And it seems that absolutely no one involved is concerned with the potential uses for evil, so long as they're in line to make a couple dollars.

ladidahh
2 replies
1d

I have uploaded an image and then used text to image, and both videos were not animated but the audio was included

lcolucci
0 replies
1d

can you clarify? what image did you use? or send the link to the resulting video

andrew-w
0 replies
1d

This can happen with non-humanoid images. The model doesn't know how to animate them.

ilaksh
2 replies
22h43m

It would be amazing to be able to drive this with an API.

sidneyprimas
1 replies
22h37m

We are considering it. Do you have anything specific you want to use it for?

ilaksh
0 replies
22h29m

Basically as a more engaging alternative to Eleven Labs or other TTS.

I am working on my latest agent (and character) framework and I just started adding TTS (currently with the TTS library and xtts_v2 which I think is maybe also called Style TTS.) By the way, any idea what the license situation is with that?

Since it's driven by audio, I guess it would come after the TTS.

andrew-w
1 replies
19h39m

I know what will be in my nightmares tonight...

eth0up
0 replies
19h26m

One person's nightmare is another's sweet dream. I, for one.. and all that.

passion__desire
0 replies
3h2m

Spotify has launched tiktok like feature where best music snippets of a track as recommended in the feed. Imagine AI art generative videos + Faces lipsyncing the lyrics could form video portion of those tracks for the feed.

jagged-chisel
0 replies
6h15m

Would be more impressive with something closer to Steve’s voice

advael
2 replies
17h31m

There is prior art here, e.g. Emo from alibaba research (https://humanaigc.github.io/emote-portrait-alive/), but this is impressive and also actually has a demo people can try, so that's awesome and great work!

wseqyrku
0 replies
2h46m

Was just about to post this, I'm yet to see a model beating that in terms of realistic quality

lcolucci
0 replies
16h27m

Yep for sure! EMO is a good one. VASA-1 (Microsoft) and Loopy Avatar (ByteDance) are two others from this year. And thanks!

WaffleIronMaker
2 replies
20h53m

Does anybody know about the legality of using Eminem's "Gozilla" as promotional material[1] for this service?

I thought you had to pay artists for a license before using their work in promotional material.

[1] https://infinity.ai/videos/setA_video3.mp4

tiahura
0 replies
15h39m

Parody is fair use.

RobinL
2 replies
1d

Have to say, whilst this tech has some creepy aspects, just playing about with this my family have had a whole sequence of laughs out loud moments - thank you!

sidneyprimas
0 replies
22h33m

This makes me so happy. Thanks for reporting back! Goal is to reduce creepiness over time.

lcolucci
0 replies
23h50m

I'm so glad! We're trying to increase the laugh out loud moments in the world :)

DevX101
2 replies
1d1h

Any details yet on pricing or too early?

lcolucci
1 replies
1d1h

It's free right now, and we'll try to keep it that way as long as possible

latentsea
0 replies
16h18m

What about open weights?

If not now, would you consider to do that with older versions of the model as you make better ones?

andrew-w
0 replies
19h33m

I see a lot of potential in animating memes and making them more fun to share with friends. Hopefully, we can do better on orcs soon!

w10-1
1 replies
1d

Breathtaking!

First, your (Lina's) intro is perfect in honestly and briefly explaining your work in progress.

Second, the example I tried had a perfect interpretation of the text meaning/sentiment and translated that to vocal and facial emphasis.

It's possible I hit on a pre-trained sentence. With the default manly-man I used the phrase, "Now is the time for all good men to come to the aid of their country."

Third, this is a fantastic niche opportunity - a billion+ memes a year - where each variant could require coming back to you.

Do you have plans to be able to start with an existing one and make variants of it? Is the model such that your service could store the model state for users to work from if they e.g., needed to localize the same phrase or render the same expressivity on different facial phenotypes?

I can also imagine your building different models for niches: faces speaking, faces aging (forward and back); outside of humans: cartoon transformers, cartoon pratfalls.

Finally, I can see both B2C and B2B, and growth/exit strategies for both.

lcolucci
0 replies
1d

Thank you! You captured the things we're excited about really well. And I'm glad your video was good! Honestly, I'd be surprised if that sentence was in the training data... but that default guy tends to always look good.

Yes, we plan on allowing people to store their generations, make variations, mix-and-match faces with audios, etc. We have more of an editor-like experience (script-to-video) in the rest of our web app but haven't had time to move the new V2 model there yet. Soon!

toisanji
1 replies
19h34m

can we choose our own voices?

andrew-w
0 replies
19h6m

The web app does allow you to upload any audio, but in order to use your voice, you would need to either record a sample for each video or clone your voice with a 3rd party TTS provider. We would like to make it easier to do all that within our site - hopefully soon!

sharemywin
1 replies
1d

you need a slider for how animated the facial expression are.

lcolucci
0 replies
23h59m

That's a good idea! CFG is roughly correlated with expressiveness, so we might to expose that to the user at some point

modeless
1 replies
22h11m

Won't be long before it's real time. The first company to launch video calling with good AI avatars is going to take off.

johnyzee
1 replies
1d

It's incredibly good - bravo. Only thing missing for this to be immediately useful for content creation, is more variety in voices, or ideally somehow specifying a template sound clip to imitate.

andrew-w
0 replies
23h56m

Thanks for the feedback! We used to have more voices, but didn't love the experience, since users had no way of knowing what each voice sounded like without creating a clip themselves. Probably having pre-generated samples for each one would solve that. Let us know if you have any other ideas.

We're also very excited about the template idea! Would love to add that soon.

jadbox
1 replies
18h46m

Awesome, any plans for an API and, if so, how soon?

andrew-w
0 replies
18h38m

No plans at the moment, but there seems to be a decent amount of interest here. Our main focus has been making the model as good as it can be, since there are still many failure modes. What kind of application would you use it for?

andrew-w
0 replies
19h50m

No, it's just a hallucination of the model. The audio in your clip is synthetic and doesn't reflect any video in the real world.

Hopefully we can animate your bear cartoon one day!

doctorpangloss
1 replies
19h36m

If you had a $500k training budget, why not buy 2 DGX machines?

andrew-w
0 replies
19h2m

To be honest, one of our main goals as a startup is to move quickly, and using readily accessible cloud providers for training makes that much more easy.

dhbradshaw
1 replies
18h21m

So good it feels like I think maybe I can read their lips

lcolucci
0 replies
18h2m

This is the best compliment :) and also a good idea… could a trained lip reader understand what the videos are saying? Good benchmark!

deisteve
1 replies
22h55m

what is the TTS model you are using

lcolucci
0 replies
22h30m

We use more than one but ElevenLabs is a major one. The voice names in the dropdown menu ("Amelia", "George", etc) come from ElevenLabs

billconan
1 replies
23h54m

can this achieve real-time performance or how far are we from a real-time model?

andrew-w
0 replies
23h29m

The model configuration that is publicly available is about 5x slower than real-time (~6fps). At lower resolution and with a less conservative number of diffusion steps, we are able to generate the video at 20-23 fps, which is just about real-time. Here is an example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...

We use rectified flow for denoising, which is a (relatively) recent advancement in diffusion models that allow them to run a lot faster. We also use a 3D VAE that compresses the video along both spatial and temporal dimensions. Temporal compression also improves speed.

ardrak
1 replies
17h29m

It often inserts hands into the frame.

Looks like too much Italian training data

lcolucci
0 replies
17h13m

this made me laugh out loud

archon1410
1 replies
22h37m

The website is pretty lightweight and easy-to-use. The service also holds up pretty well, specially if the source image is high-enough resolution. The tendency to "break" at the last frame happens with low resolution images it seems.

My generation: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

lcolucci
0 replies
22h33m

Thank you! It's interesting you've noticed the last frame breakdown happening more with low-res images. This is a good hypothesis that we should look into. We've been trying to debug that issue

SlackingOff123
1 replies
17h38m

Oh, this is amazing! I've been having so much fun with it.

One small issue I've encountered is that sometimes images remain completely static. Seems to happen when the audio is short - 3 to 5 seconds long.

sidneyprimas
0 replies
16h36m

Can you share an example of this happening? I am curious. We can get static videos if our model doesn't recognize it as a face (e.g. an Apple with a face, or sketches). Here is an example: https://toinfinityai.github.io/v2-launch-page/static/videos/...

I would be curious if you are getting this with more normal images.

PerilousD
1 replies
1d

Damn - I took an (AI) image that I "created" a year ago that I liked and then you animated it AND let it sing Amazing Grace. Seeing IS believing this technology pretty much means video evidence ain't necessarily so.

lcolucci
0 replies
1d

We're definitely moving into a world where seeing is no longer believing

whitehexagon
0 replies
8h36m

Thank you for no signup, it's very impressive, especially the physics of the head movement relating to vocal intonation.

I feel like I accidentally made an advert for whitening toothpaste:

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

I am sure the service will get abused, but wish you lots of success.

snickmy
0 replies
10h17m

Out of curiosity, where are you training all this ? aka where do you find the money to support such training

slt2021
0 replies
23h55m

great job Andrew and Sidney!

siscia
0 replies
6h20m

Can I get a pricing quote?

sharemywin
0 replies
1d

accidentally clicked the generate button twice.

scotty79
0 replies
7h44m

It's awesome for very short texts. Like a single long sentence. For even a bit longer sequences it seems to be losing adherence to the initial photo and also venture into uncanny valley with exaggerated facial expressions.

A product that might be build on top of this could split the input into reasonable chunks, generate video for each of them separately and stitch them with another model that can transition from one facial expression into another in a fraction of a second.

Additional improvement might be feeding the system not with one image but with a few expressing different emotional expressions. Then the system could analyze the split input to find out in which emotional state each part of the video should be started on.

On unrelated note ... generated expressions seem to be relevant to the content of the input text. So either text to speech might understand language a bit or the video model itself.

protocolture
0 replies
7h49m

Sadly wouldnt animate an image of shodan from system shock 2

lofaszvanitt
0 replies
1d

Rudimentary, but promising.

kemmishtree
0 replies
12h6m

I'd love to enable Keltar, the green guy in the ceramic cup, to do this www.molecularReality/QuestionDesk

fsndz
0 replies
13h54m

super nice. why does it degrade quality of image so much, makes it looks obviously AI-generated rapidly.

dvfjsdhgfv
0 replies
8h30m

Hi, there is a mistake in the headline, you wrote "realistic".

bosky101
0 replies
2h22m

Dayum

barrenko
0 replies
8h38m

Talking pictures. Talking heads!

atum47
0 replies
22h35m

This is super funny.

android521
0 replies
10h8m

This is great. is it open source? is there an api and what is the pricing?

Log_out_
0 replies
2h13m

and mow a word from our..