return to table of content

VASA-1: Lifelike audio-driven talking faces generated in real time

TriangleEdge
17 replies
1d

Why is this research being done? Is this some kind of arms race? The only purpose of this technology I can think of is getting spies to abuse others.

Am I going to have to do AuthN and AuthZ on every phone call and zoom now?

Arnavion
9 replies
1d

On the other hand, if deepfaking becomes common enough that everyone stops trusting everything they read / see on the internet, it would be a net good against the spread of disinformation compared to today.

hiatus
3 replies
1d

I don't see that as an outcome. We have already seen a grand erosion of trust in institutions. Moving to an even lower trust society does not sound like it would have positive consequences for discourse, public policy, or society at large.

throwthrowuknow
1 replies
5h29m

The benefit is that you can only trust in person interaction with social and governmental institutions so people will have to leave their damn house again and go talk to each other face to face. Too many of our current problems are caused by people only interacting with each other and the world through third parties who are performing a MITM operation for their own benefit.

1attice
0 replies
2h41m

This assumes that it's a two-way door.

Over the past century and a half, we've moved into vast, anonymous spaces, where I'm as likely to know and get along with my neighbour as I am to win the lottery.

And this is important. No, it's not just a matter of putting on an effort to learn who my neighbour is -- my neighbour is literally someone whose life experiences are wildly different, whose social outcomes will be wildly different, whose beliefs and values are wildly different, and, for all I know, goes to conferences about how to eliminate me and my kind.

(This last part is not speculation; I'm trans; see: CPAC)

And these are my reasons. My neighbour is probably equivalently terrified of me, or what I represent, or the media I consume, or the conferences that I go to.

Generalizing, you can't take a bunch of random people whose only bond is that they share meatspace-proximity, draw a circle around them, and declare them a community; those communities are _gone_, and you can no more bring them back than you can revive a corpse. (This would also probably not be a good idea, even if it were possible: they were also incredibly uncomfortable places for anyone who didn't fit in, and we have generations of fiction about people risking everything to leave for those big anonymous cities we created in step 1.)

So, here we are, dependent on technology to stay in touch with far-flung friends and lovers and family, all of us, scattered like spiderwebs across the globe, and now into the strands drips a poison.

Daniel Dennett was right. Counterfeit people are an enormous danger to civilization. Research like this should stop immediately.

rightbyte
0 replies
23h49m

Ironically low effort deep fakes might increase trust in organizations that have had the budget to fake stuff since their inception. The losers are 'citizen journalist' broadcasting on Youtube etc.

notaustinpowers
1 replies
23h47m

I don't see the extinction of trust through the introduction of garbage falsehoods to be a net good.

Believing that everything you eat is poisoned is no way to live. Believing that everything you see is a lie is also no way to live.

throwthrowuknow
0 replies
5h25m

Before photography this was just the normal state of the world. Think a little, back then any story or picture you saw was made by a person and you only had their reputation to go by. Think some more and you realize that’s never changed even with pictures and video. Easy AI generated pictures and video just remove the illusion of trust.

anigbrowl
1 replies
21h18m

everyone stops trusting everything

Why would you expect this to happen? Lots of people are gullible, if it were otherwise a lot of well-known politicians would be out of a job or would never have been elected to begin with.

ryandrake
0 replies
20h52m

If it's even commoner than "common enough" then anyone could at least try to help their gullible friends and family by sending them a deepfake video of them doing/saying something they've never said. A lot of people will suddenly wise up when a problem affects them directly.

piva00
0 replies
19h36m

That's the whole issue though, spread of disinformation eroded trust, furthering this into obliteration of all trust is not a good outcome.

andybak
1 replies
23h51m

Because the text for this is only a slight variation of the tech for a broad range of legitimate applications?

Because even this precise tech has legitimate use cases?

The only purpose of this technology I can think of is getting spies to abuse others.

Can you really not think of any other use cases?

krainboltgreene
0 replies
17h19m

Why don't you list some legitimate and useful values of this work? Especially at the price we and this company are paying.

tithe
0 replies
1d

I get the feeling it's "someone's going to do this, so it might as well be us."

It's fascinating how research can take on a life of its own and will be pushed, by someone, to its own conclusion. Even for immensely destructive technologies (e.g., atomic weapons, viruses), the impact of a technology is its own attractor (could you say that's risk-seeking behavior?)

Am I going to have to do AuthN and AuthZ on every phone call and zoom now?

"Alexa, I need an alibi for yesterday at noon."

phkahler
0 replies
5h10m

Newscasters and other talking heads will be out of business. Just pipe the script into some AI and get video.

danmur
0 replies
5h37m

We all know why this is really happening. Clippy 2.0.

HarHarVeryFunny
0 replies
2h20m

Why is this research being done?

I think it's mostly "because it can be done". These types of impressive demos have become relatively low hanging fruit in terms of how modern machine learning can be applied.

One could imagine commercial applications (VR, virtual "try before you buy", etc), but things like this can also be a flex by the AI labs, or a PhD student wanting to write a paper.

1659447091
0 replies
21h37m

Advertising. Now you and your friends star in the streaming commercials and digital billboards near you! (whether you want to or not)

balls187
14 replies
1d

I'm curious what is the reason for deepfake research and what the practical application is.

Can someone explain the commercial need to take someones likeness and generate video content?

If I was an a-list celebrity, I would give permission for coke to make a commercial with my likeness, provided I am allowed final approval of the finished ad?

Do I have an avatar that attends my zoom work calls?

JamesBarney
3 replies
1d

Video games, entertainment, and avatars seems like the big ones.

HeatrayEnjoyer
2 replies
23h55m

If that is really the reason then this is insane and everyone involved should put their keyboards down and stop what they are doing.

This would be as if we invented and sold nuclear weapons to dig out quarry mines faster. The inconvenience it saves us quickly disappears into the overwhelming shadow of the enormous harm now enabled.

ImPostingOnHN
1 replies
23h44m

> This would be as if we invented and sold nuclear weapons to dig out quarry mines faster.

”Project Plowshare was the overall United States program for the development of techniques to use nuclear explosives for peaceful construction purposes.”[0]

0: https://en.wikipedia.org/wiki/Project_Plowshare

wumeow
0 replies
23h40m

Yeah, and it was terminated. Much harder to put this genie back in the bottle.

bugglebeetle
1 replies
1d

State disinformation and propaganda campaigns.

NortySpock
0 replies
23h48m

Corporate disinformation and propaganda campaigns.

Personal disinformation and propaganda campaigns.

Oh Brave New World, that has such fake people in it!

szundi
0 replies
23h54m

Imagine being the CEO and you just grab your salary and options, go home, sit in the hot tub while one of the interns carefully prompts GPT and VASE how you are giving a speech online about strategic directions. /s

r1chardnl
0 replies
1d

Apple Vision Pro personas competition

mensetmanusman
0 replies
1d

The purpose is to give remote workers the ability to clone themselves and automate their many jobs. /s

(but actually, because laziness is the driver of all innovation, I wouldn't be surprised if this happens).

jdietrich
0 replies
23h49m

In this case, replacing humans in service jobs. From the paper:

"Such technology holds the promise of enriching digital communication, increasing accessibility for those with communicative impairments, transforming education methods with interactive AI tutoring, and providing therapeutic support and social interaction in healthcare."

A convincing simulacrum of empathy could plausibly be the most profitable product since oil.

hypeatei
0 replies
1d

Entertainment maybe? I know that's not necessarily an ethical reason but some have made hilarious AI-generated songs already.

criddell
0 replies
23h44m

If beautiful people have an advantage in the job market, maybe people will use deepfake technology when doing zoom interviews? Maybe they will use it to alter their accent?

bonton89
0 replies
23h48m

Propaganda, political manipulation, narrative nudging, regular scams and advertising.

Even though most of those things are illegal you could just have foreign cat's paw firms do it. Maybe you fire them for "going to far" after the damage is done, assuming some even manages to connect the dots.

SkyPuncher
0 replies
1d

One the surface, it's a simple, understandable demo for the masses. While at the same time, it hints at deeper commercial usage.

Disney has been using digital likeness to maintain characters who's actors/actresses have died. Princess Leia is the most prominent example. Arguably, there is significant realistic value in being able to generate a human-like character that doesn't have to be recast. That character can be any age, at any time, and look exactly like the actor/actress.

For actors/actresses, I suspect many of them will start licensing their image/likeness as they look to wind down their careers. It gives them on-going income with very little effort.

FredPret
14 replies
23h52m

Anyone have any good ideas for how we're going to do politics now?

Today a big ML model can do this and it's somewhat regulate-able, tomorrow people can do this on their contact-lens supercomputers and anyone can generate a video of anything.

Is going back to personally knowing your local representative the only way? How will we vote for national candidates if nobody knows what they think or say?

dwb
4 replies
23h26m

We already rely on chains of trust going back to the original source, and will still. I find these alarmist posts a bit mystifying – before photography, anyone could fake a quote of anyone, and human civilisation got quite far. We had a bit over a hundred years where phographic-quality images were possible and very hard to fake (which did and still does vary with technology), but clearly now we’re past that. We’ll manage!

woleium
0 replies
20h46m

The issue is better phrased as “how will we survive the transition while some folk still believe the video they are seeing is irrefutable proof the event happened?”

marcusverus
0 replies
22h58m

Presidential elections are frequently pretty close. Taking the electoral college into account (not the popular vote, which doesn't matter) Donald Trump won the 2016 election by a grand total of ~80,000 votes in three states[0].

Knowing that retractions rarely get viral exposure, it's not difficult to imagine that a few sufficiently-viral videos could swing enough votes to impact a presidential election. Especially when considering that the average person is not up to speed on the current state of the tech, and so has not been prompted to build up the mindset that's required to fend off this new threat.

[0] https://www.washingtonpost.com/news/the-fix/wp/2016/12/01/do...

GeoAtreides
0 replies
20h9m

In the before times we didn't have social media and its algorithms and reach. Does it matter that the chains of trust debunk a viral lie 24 hours after it had spread? Not that there's a lot of trust in the chains of trust to begin with. And if you still have trust, then you're not the target of the viral lie. And if you still have trust, then how long can you hold on that trust when the lies keep coming 24/7 one after another without end. As one movie critic once put it: You might not have noticed it, but your brain did. Very malleable this brain of ours.

The civilization might be fine, sure. Now, democracy, on the other hand...

BobaFloutist
0 replies
23h7m

Yeah I mean tabloids have been fooling people with doctored photos for decades.

Potentially we'll need slightly tighter regulations on formal press (so that people that care for accurate information have a place they can get it) and definitely we'll want to steer the culture back towards holding them accountable for misinformation, but credulous people have always had easy access to bad information.

I'm much more worried at the potential abuse cases that involve ordinary people that aren't public figures, and have much less ability to defend themselves. Heck, even celebrities are a more vulnerable targets than politicians.

qup
1 replies
23h50m

People in my circles have been saying this for a few years now, and we've yet to see it happen.

I've got my popcorn ready.

But you can rest easy. Everyone just votes for the candidate their party picked, anyway.

FredPret
0 replies
23h44m

It'll happen - deepfakes aren't good enough yet. But when they become ubiquitous and hard to spot, it'll be chaos until the average person is mentally inoculated against believing any video / anything on the internet.

I wonder if it's possible to digitally sign footage as it's captured? It'd be nice to have some share-able demonstrably true media.

Edit: I'm a centrist and I definitely would lean one way or the other based on who the options are (or who I think they are).

cchance
0 replies
22h1m

Didn't see that one pretty cool, not as good as Emo or Vasa but pretty good

kmlx
0 replies
23h40m

How will we vote for national candidates if nobody knows what they think or say?

i’m going to burst your bubble here, but most voters have no idea about policies or candidates. most voters vote based on inertia or minimal cues, not on policies or candidates.

i suggest you look up “The American Voter”, “The Democratic Dilemma: Can Citizens Learn What They Need to Know?” and “American National Election Studies”.

hx8
0 replies
23h45m

Hyper targeted placement of generated content designed to entice you to donate to political campaigns and to vote. Perhaps leading to a point where entire video clips are generated for a single viewer. Politicians and political commentators will lease their likeness and voice out for targeted messaging to be generated using their likeness. Less reputable platforms will allow disinformation campaigns to spread.

hooverd
0 replies
21h12m

People already believe any quote you slap on a JPEG.

TimedToasts
0 replies
21h4m

Anyone have any good ideas for how we're going to do politics now?

If a business is showing a demo of this you can be assured that the Government already has this tech and has for a period of time.

How will we vote for national candidates if nobody knows what they think or say?

You don't know what they think or say now - hopefully this disabuses people of this notion.

4ndrewl
0 replies
23h45m

DNS? Might be that we need a radical (for some) change of viewpoint.

Just as there's no privacy on the internet, how about 'theres very little trust on the internet'. Assume everything not securely signed by a trusted party is false.

alfalfasprout
13 replies
1d

What this is starting to reveal is that there's a clear need for some kind of chain of custody system that guarantees the authenticity of what we see. Nikon/Canon tried doing this in the past, but improper storage of private keys lead to vulnerabilities. As far as I'm aware it's never extended to video either.

With modern secure hardware keys it may yet be possible. The difficulty is that any kind of photo/video manipulation would break the signature (and there are practical reasons to want to be able to edit videos obviously).

In the ideal world, any mutation to the original source content could be traceable to the original source content. But that's not an easy problem to solve.

throw__away7391
8 replies
1d

No, we are merely returning to the pre-photography state of things where a mere printed image is not sufficient evidence for anything.

hx8
3 replies
23h51m

True, an image, audio clip, or video is not enough evidence to establish truth.

We still need a way to establish truth. It's important for security cameras, for politics, and for public figures. Here are some things we could start looking into.

* Cameras that sign their output. Yes, this camera caught this video, and it hasn't been modified. This is a must for recordings being used in court evidence IMO. Otherwise framing a crime is as easy as a few deep fakes and planting some DNA or fingerprints at the scene of the crime.

* People digitally signing pictures/audio/videos of them. Even if they digitally modified the data it shows that they consent to having their image associated with that message. It reduces the strength of the attack vector of deep fake videos for reputation sabotage.

* Malicious content source detection and flagging. Think email spam filter type tagging of fake content. Community notes on X would be another good example.

* Digital manipulation detection. I'm less than hopeful this will be the way in the long term, but could be used to disprove some fraud.

alchemist1e9
1 replies
23h36m

Blockchains can be used for cryptography time-stamping.

I’ve always had a suspicion that governments and large companies would prefer a world without hard cryptographic proofs. After wikileaks they noticed DKIM can cause them major blowback. Somehow general public isn’t aware all the emails were proven authentic with DKIM signatures and even in fairly educated circles people believe the “emails were fake” but it’s not actually possible.

PeterisP
0 replies
12h41m

Quite the opposite, governments and large companies even explicitly run services for digital timestamping of documents - if I wanted to potentially assert some facts in court, I'd definitely prefer having that e-document with a timestamp notarized from my local government service instead of Bitcoin, because while the cryptography is the same, it would be much simpler from the practical legal perspective, requiring less time and effort and cost to get the court to accept that.

alex_suzuki
0 replies
23h34m

Signing is great, but the hard part is managing keys and trust.

tass
1 replies
1d

There goes the dashcam industry…

barbazoo
0 replies
1d

You're being downvoted but I think the comment raises a good question. what will happen when someone gets accused of doctoring their dashcam footage? Or any footage used for evidence.

anigbrowl
1 replies
21h15m

merely

You say this as if it were not a big deal, but losing a century's worth of authentication infrastructure/practises is a Bad Thing which will have large negative externalities.

throw__away7391
0 replies
6h0m

It isn't really though. It has been technical possible to convincingly doctor photos for some time already, gradually getting easier, cheaper, and faster with time for decades, and even now the currently available tech has limitations and the full change is not going to happen overnight.

bonton89
1 replies
23h56m

I expect this type of system to be implemented in my lifetime. It will allow whistleblowers and investigative sources to be discredited or tracked down and persecuted.

20after4
0 replies
7h8m

Unfortunately that seems inevitable.

PeterisP
1 replies
12h52m

I think it's practically impossible for such a system to be globally trustworthy due to the practical inevitability of "but improper storage of private keys lead to vulnerabilities" scenarios.

People will expect or require that chain of custody only if all or at least the vast majority of the content they want would have that chain of custody.

Photo/video content will have that chain of custody only if all or almost all of devices recording that content will support it - including all the cheapest mass-produced devices in reasonably widespread use anywhere in the world.

And that chain of custody provides the benefit only if literally 100% of these manufacturers have their private keys secure 100% of the time, which is simply not happening; at least one such key will leak, if not unintentionally then intentionally for some intelligence agency who wants to fake content.

And what do you do once you see a leak of the private keys used for signing the certificates for the private keys securely embedded in (for example) all of 2029 Huawei smartphones, which could be like 200 million phones? The users won't replace their phones just because of that, and you'll have all these users making content - so everyone will have to choose to either auto-block and discard everything from all those 200 million users, or permit content with a potentially fake chain of custody; and I'm totally certain that most people will prefer the latter.

macrolime
0 replies
5h15m

Multisig by the user and camera manufacturer can help to some extent.

cs702
10 replies
23h49m

And it's only going to get faster, better, easier, cheaper.[a]

Meanwhile, yesterday my credit card company asked me if I wanted to use voice authentication for verifying my identity "more securely" on the phone. Surely the company spent many millions of dollars to enable this new security-theater feature.

It begs the question: Is every single executive and manager at my credit card company completely unaware that right now anyone can clone anyone else's voice by obtaining a short sample audio clip taken from any social network? If anyone is aware, why is the company acting like this?

Corporate America is so far behind the times it's not even funny.

---

[a] With apologies to Daft Punk.

user_7832
3 replies
23h28m

Is every single executive and manager at my credit card company completely unaware that right now anyone can clone anyone else's voice by obtaining a short sample audio clip taken from any social network?

Your mistake is assuming the company cares. The "company" is a hundred different disjointed departments that only care about not getting caught Equifax-style (or filing for bankruptcy if caught). If the marketing director sees a shiny new thing that might boost some random KPI they may not really care about security.

However in the rare chance that your bank is actually half decent, I'd suggest contacting their IT/Security teams about your concerns. Maybe you'll save some folks from getting scammed?

cyanydeez
2 replies
22h16m

Also, this feature is probably just some midd level execs plan for a bonus, not a rigorously reviewed and planned. It's also probably in the pipeline for a decade so if they don't push it out, suddenly they get no bonus for cancelling a project.

Corporations are ultimately no better than governments and likely worse depending on what their regulatory environment looks like.

iamflimflam1
1 replies
12h8m

There’s a really important thing here for anyone trying to do sales to big companies.

Find an exec that needs a project to advance their career. Make your software that project.

Suck in as many other execs into the project so their careers become coupled to getting your software rolled out.

amindeed
0 replies
11h8m

That's clever!

ryandrake
1 replies
20h59m

Any time you add a "new" security gate to your product, it should be in addition to and not instead of the existing gates. Biometrics should not replace username/password, they should be in addition to. Security Questions like "What was your first pet's name" should not be able to get you in the backdoor. SMS verification alone should not allow you to reset your password. Same with this voice authentication stuff. It should be another layer, not a replacement of your actual credentials.

If you treat it as OR instead of AND, then your security is only as good as the worst link in the chain.

recursive
0 replies
2h59m

If you make your product sufficiently inconvenient, then you'll have the unassailable security posture of having no users.

fragmede
1 replies
23h25m

I mean, what do you want them to do? If we think their security officers are freaking out and holding meetings right now about what to do, or if they're asleep at the wheel, we'd be seeing the same thing from the outside, no?

addandsubtract
0 replies
17h11m

No, because multiple companies are pushing this atm. If it was only company I would agree, but with multiple, you'd have at least one that would back out of it again.

dade_
0 replies
4h22m

Yes, they are and they also know it isn't foolproof so that isn't the only information being compared against. Some services compare the calling number is compared against live activity on the PSTN (ie, subscriber's phone is not in an active call, but their number is being presented as as the caller ID is one such metric). Many of these deep fake generators with public access have watermarks in the audio. The audio stream comparison continues, it needs to speak like you, word and phrase choices. There are other fingerprints of generated audio that you can't hear, but are still obvious at the moment. With security, it always cat and mouse with fraudsters on one hand and the effort/frustration with customers on the other.

Asking customers questions that they don't remember and that fraudsters have in front of them isn't working and the time it takes for agents to authenticate is very expensive.

While there is no doubt that companies will screw up with security, you are making wild accusations without reference to any evidence.

nycdatasci
7 replies
2d7h

“We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”

justinclift
3 replies
2d3h

until we are certain that the technology will be used responsibly ...

That's basically "never" then, so we'll see how long they hold out.

Scammers are already using the existing voice/image/video generation apparently fairly successfully. :(

spacemanspiff01
1 replies
1d16h

Having a delay, where people can see what's coming down the pipe, does have value. In a year there may/will be a open source model.

But knowing that this is possible is important to know.

I'm fairly clued in, and am constantly surprised at how fast things are changing.

justinclift
0 replies
1d16h

But knowing that this is possible ...

Who knowing this is possible?

The general elderly person isn't going to know any time soon. The SV IT people probably will.

It's not an even distribution of knowledge. ;/

ilaksh
0 replies
1d23h

Eventually someone will implement one of these really good recent ones as open source and then it will be on replicate etc. right now the open source ones like SadTalker and Video Retalking are not live and are unconvincing.

sitzkrieg
0 replies
2d1h

money will change that

feyman_r
0 replies
2d3h

/s it doesn’t have the phrase LLM in the title

araes
0 replies
1d1h

Translation: "We're attempting to preserve our moat, and this is the correct PR blurb. We'll release an API once we're far enough ahead and extracted enough money."

Like somebody on Ars noted "anybody notice it's an election year?" You don't need to release an API, all online videos are now suspicious authenticity. Somebody make a video of Trump or Biden's eyes following the mouse cursor around. Real videos turned into fake videos.

m3kw9
6 replies
1d

If you see talking heads with static/simple/blurred backgrounds from now on, assume it is fake. In the near future they will accompany realistic backgrounds and even less detectable fakes, we will have to assume all vids could be faked.

hypeatei
2 replies
1d

I wonder how video evidence in court is going to be affected by this. Both from a defense and prosecution perspective.

Technically videos could've been faked before but it would require a ton of effort and skill that no average person would have.

greenavocado
0 replies
1d

There will be a new cottage industry of AI detectives that serve as expert witnesses and they will attest to the originality of media to the court

PeterisP
0 replies
12h47m

Just as before, a major part of photo or video evidence in court is not the actual video itself, but a person testifying "on that day I saw this horrible event, where these things happened, and here's attached evidence that I filmed which illustrates some details of what I saw." - which would be a valid consideration even without the photo/video, but the added details do obviously help.

Courts already wouldn't generally approve random footage without clear provenance.

Retric
2 replies
1d

I still find the faces themselves to be really obviously wrong. The sound is just off, close enough to tell who is being imitated but not particularly good.

tyingq
0 replies
7h27m

It's interesting to me that some of the long-standing things are still there. For example, lots of people with an earring in only one ear, unlikely asymmetry in the shape or size of their ears, etc.

tredre3
0 replies
1d

Especially the hair "physics" and sometimes the teeth shift around a bit.

But that's nitpicking. It's good enough to fool someone not watching too closely. And the fact that the result is this good with a single photo is truly astonishing, we used to have to train models on thousands of photos for days only to end up with a worse result!

karaterobot
4 replies
23h59m

We need some clear legislation around this right now.

stronglikedan
1 replies
23h52m

counterpoint: we don't need any more legislation

qwertox
0 replies
23h20m

I tend towards agreeing with you. Many of the problems, like impersonation, are already illegal.

And replacing a person which spreads lies, as can be seen in most TV or glossy cover ads, shouldn't trigger some new legal action. The only difference is that now the actor is also a lie.

And countries which use actors or news anchors for spreading propaganda surely won't see an issue with replacing them with AI characters.

People who then get to read that their most favorite, stunningly beautiful Instagram or TikTok influencer is nothing but a fat, chips-eating ugly person using AI, may try to raise some legal issues to soothe their disappointment. They then might raise a point which sounds reasonable, but which would then force politicians to also tackle the lies which are spread in TV/Magazines ads.

Maybe clearly labeling any use of this tech, maybe even with a QR code linking to who is the owner of the AI, similar to QR codes on meat packaging which allow you to track the origin of the meat, would be something what laws could be helpful with, in the spirit of transparency.

CamperBob2
0 replies
19h49m

Legislation only impairs the good guys.

4ndrewl
0 replies
23h48m

In which jurisdiction?

fluffet
3 replies
2d5h

This is absolutely crazy. And it'll only get better from here. Imagine "VASA-9" or whatever.

I thought deepfakes were still quite a bit away but after this I will have to be way more careful online. It's not far from behind something that can show up in your "YouTube shorts" feed and trick you if you didn't already know it was AI.

vessenes
1 replies
1d20h

Hard disagree -- I think you might be misremembering how EMO looks in practice -- I'm sure we'll learn VASA-1 "telltales" but to my eyes there are far fewer than EMO - zero of the EMO videos were 'perfect' for me, and many show little glitches or missing sync. VASA-1 still blinks a bit more than I think is natural, but it looks much more fluid.

Both are, BTW, AMAZING!! Pretty crazy.

smusamashah
0 replies
1d6h

In VASA there is way to much body movement instead of just being he head as if camera is moving in the strong winds. EMO is a lot more human like. In the very first video on the EMO page I still cannot see it as a generated video, its that real. The lip movement, the expressions are in almost in perfect sync with the voice. That is absolutely not the case with VASA

physhster
2 replies
1d

A fantastic technological advance for election interference!

RGamma
0 replies
23h59m

Such an exciting startup idea! I'm thrilled!

IshKebab
0 replies
7h14m

As if this technology was needed.

gedy
2 replies
2d4h

My first thought was "oh no the interview fakes", but then I realized - what if they just kept using the face? Would I care?

acidburnNSA
0 replies
1d12h

Yeah, even if they just use LLMs to do all the work, or are a LLM themselves, as long as they can do the work I guess.

Weird implications for various regulations though.

PeterisP
0 replies
12h37m

It would be interesting that a remote candidate could easily identify as whatever ethnicity, age or even gender they consider most beneficial for hiring to avoid discrimination or fit certain diversity incentives.

Tech like this has the potential to bring us back to the days of "on the Internet, nobody know's you're a dog" https://en.wikipedia.org/wiki/On_the_Internet,_nobody_knows_...

IshKebab
2 replies
1d23h

Oh god don't watch their teeth! Proper creepy.

Still, apart from the teeth this looks extremely convincing!

ygjb
0 replies
1d22h

yeah, teeth, tongue movement and lack of tongue shape and the "stretching" of the skin around the cheeks in the images pushed the videos right into the uncanny valley for me.

mtremsal
0 replies
5h27m

The teeth resizing dynamically is incredibly distracting, or more positively, a nice way to identify fakes. For now.

smusamashah
1 replies
22h21m

This is good but nowhere as good as EMO https://humanaigc.github.io/emote-portrait-alive/ (https://news.ycombinator.com/item?id=39533326)

This one has too much fake looking body movement and looks eerie/robotic/uncanny valley. The lips don't sync properly in many places. Eye movement and over all head and body movement is not very natural at all.

While EMO looks just perfect mostly. The very first two videos on EMO page are perfect example of that. See the rap near the end to see how good EMO is at lip sync.

cchance
0 replies
22h3m

Another research project with 0 model release

pxoe
1 replies
2d2h

maybe making a webpage with 27 videos isn't the greatest web design idea

sitzkrieg
0 replies
2d1h

the busted two scrolling sections on mobile really doesnt help

mdrzn
1 replies
2d10h

Holy shit these are really high quality and basically in realtime on a 4090. What a time to be alive.

rbinv
0 replies
2d9h

It really is something. 40 FPS on a 4090, damn.

jazzyjackson
1 replies
1d13h

i get why this is interesting but why is it desirable?

real jurassic park "too preoccupied with whether they could" vibes

acidburnNSA
0 replies
1d12h

Now I can join the meeting "in a suit" while being out paddleboarding!

SirMaster
1 replies
23h51m

It looks all warpy and stretchy. That's not how skin and face muscles work. Looks fake to me.

Zopieux
0 replies
4h25m

I find the hairs to be the least realistic, they look elastic, which is unsurprising: highly detailed things like hairs are hard to simulate with good fidelity.

thih9
0 replies
1h22m

I could see this being used in movie production.

qwertox
0 replies
2d6h

So an ugly person will be able to present his or her ideas on the same visual level as a beautiful person. Is this some sort of democratization?

nojvek
0 replies
2d8h

I like the considerations topic.

There’s likely also a an unsaid statement. This is for us only and we’ll be the only ones making money from it with our definition of “safety” and “positive”.

metalspoon
0 replies
16h32m

AI can talk with me. Why need a friend in real life?

ilaksh
0 replies
1d23h

The paper mentions it uses Diffusion Transformers. The open source implementation that comes up in Google is Facebook Research's PyTorch implementation which is a non-commercial license. https://github.com/facebookresearch/DiT

Is there something equivalent but MIT or Apache?

I feel like diffusion transformers are key now.

I wonder if OpenAI implemented their SORA stuff from scratch or if they built on the Facebook Research diffusion transformers library. That would be interesting if they violated the non-commercial part.

Hm. Found one: https://github.com/milmor/diffusion-transformer-keras

gavi
0 replies
2d7h

The GPU requirements for realtime video generation are very minimal in the grand scheme of things. Assault on reality itself.

fullstackchris
0 replies
2d5h

lol how does something like this get only 50ish votes but some hallucinating video slop generator from some of the other competitors gets thousands?

egberts1
0 replies
23h12m

Cool! Now we can expect to see an endless stream of dead president's speeches "LIVE" from the White House.

This should end well.

andrewstuart
0 replies
23h44m

Despite vast investment in AI by VCs and vast numbers of startups in the field, these sort of things remain unavailable as simple consumer installable software.

Every second day HN has some post about some new amazing AI system. Never available to download run and use.

Why the vast investment and no startup selling consumer downloadable software to do it?

acidburnNSA
0 replies
2d14h

Oh no. "Cameras on please!" will be replaced by "AI generated faces off please!" in teams.

RcouF1uZ4gsC
0 replies
20h49m

To show off the model, Microsoft created a VASA-1 research page featuring many sample videos of the tool in action

With AI stuff, I have learned to be very skeptical until and unless a relatively publicly accessible demo with user specified inputs is available.

It is way too easy for humans to cherry pick the nice outputs, or to take advantage of biases in the training data to generate nice outputs, and is not at all reflective of how it holds up in the real world.

Part of the reason why ChatGPT, Stable Diffusion, Dall-E had such an impact is the people could try and see for themselves without being told how awesome it was by the people making it.

BobaFloutist
0 replies
23h11m

Oh good!