This AI Clones Your Voice After Listening for 5 Seconds

Aiursrage2k · Nov 14, 2019

Incredible imagine being able to get voice actors for your game by using 5 seconds of audio

https://google.github.io/tacotron/publications/speaker_adaptation/

neoshaman · Nov 14, 2019

It doesn't do emotional intonation, so no, you need a data set with performance acting labelled, and then on top of that you need a voice director annotation to give broader context, because you can have hundreds of way to express the same emotion, so choosing the right one that fit the broader narrative context is key.

So you see, I have no doubt AI can do it, the problem is that you will have to first break down voice direction and text analysis in a coherent semantic notation, so you can pass that context to the AI to imitate. Given that's a field that rely a lot on intuition, because well we just live through emotion, we don't need to understand them mechanically, good luck finding that domain definition and use AI on it.

It's not impossible, it's just a vast field of research for you to tackle, just so you can have some voice in your game.

Probably style transfer type of technique will be more useful in the short term. Ie a single person do all the voice by itself, and transfer with AI to different voice, think autotunes for voice acting.

Murgilod · Nov 14, 2019

I could see it being useful for computer narration that doesn't rely on emotional intonation, like if you had a ship computer that had to talk to the crew or something.

sxa · Nov 14, 2019

neoshaman said: ↑

So you see, I have no doubt AI can do it, the problem is that you will have to first break down voice direction and text analysis in a coherent semantic notation, so you can pass that context to the AI to imitate..
Click to expand...

Someone with experience of something like Vocaloids etc shouldnt have too much difficulty notating it though; they'd already be familiar with doing it (ie via a musical DAW's track editor.)
That'd be the easy/sane basis for notating it, btw. Intonation via pitch/volume/duration map fairly well.

frosted · Nov 14, 2019

I was thinking about giving this a spin a few weeks ago.

Being able to insert player names (and other runtime generated stuff) into speech can really do a lot for some games. Maybe not ready yet, but soon.

neoshaman · Nov 14, 2019

frosted said: ↑

Being able to insert player names (and other runtime generated stuff) into speech can really do a lot for some games. Maybe not ready yet, but soon.
Click to expand...

This is actually the better use case, it seems, since it would simply replace the content and not change teh semantic, assuming you can train the NN with enough sample and the correct architecture.

sxa said: ↑

Someone with experience of something like Vocaloids etc shouldnt have too much difficulty notating it though; they'd already be familiar with doing it (ie via a musical DAW's track editor.)
That'd be the easy/sane basis for notating it, btw. Intonation via pitch/volume/duration map fairly well.
Click to expand...

Nah vocaloid are for music, not for acting, there is much more to acting than pitch and all of that.

That said, mediocre acting, ie good enough, can probably be possible so maybe. Infinite mediocre acting that does the job is still better than no acting or typical text to speech. I mean we have to live with so many high profile game having mediocre french dub ...

Also it could be enough of proof of concept to make good people starting to think about that use case, mmmmm

It's funny because this and facial performance were teh big thing that separate indie from AAA.

Murgilod · Nov 14, 2019

frosted said: ↑

I was thinking about giving this a spin a few weeks ago.

Being able to insert player names (and other runtime generated stuff) into speech can really do a lot for some games. Maybe not ready yet, but soon.
Click to expand...

...actually, that player name integration idea without having to record constant variations like in fo4 is a great use for this. It'd take some work to make it flow properly, but it's far more doable that going full vocaloid.

Aiursrage2k · Nov 14, 2019

I am assume they will eventually add tone. It seems like they have too little data, but things like this come in the future.
<Sad>It was the saddest day of my life</sad><Angry>yet the must hilarious</Angry>

The same thing for art assets. It will just be a matter of time until

AndersMalmgren · Nov 15, 2019

Artists and taxi drivers, no one is safe from automation.
Lucky for us devs that devs are needed for automation

Billy4184 · Nov 15, 2019

What would be good is if it could take an 'amateur' voice and polish it up to sound more professional, including the emotion and intonation.

All these procedural generation type tools don't really excite me unless they have the ability to convert low-medium quality input to high quality output. That's the way to keep the signature of a game intact when you are generating things, and to be able to manage the fundamental characteristics of the result.

Aiursrage2k · Nov 15, 2019

They are making great strives in AI, I say give it a decade or so and we will be able to use these systems as input to create user content on the fly personalized to the individual playing.

->Feel like playing a "horror game" set on a "titanic" in "space" where the monster chasing you is a "zombie cheerleader".... then this content auto-generates the level almost instantly streams down to you instantly using 5g internet in seconds.

neoshaman · Nov 15, 2019

It's not so easy, because the problem isn't AI, it's our understanding of the problem, you can't set up an AI to automate a problem you don't understand.

Billy4184 said: ↑

unless they have the ability to convert low-medium quality input to high quality output.
Click to expand...

Depend on what you mean about quality:
1. sampling quality
Well that's the goto NN stuff , super resolution, also creating the data is super easy to train a network, you take high quality audio, then generate low quality from it of various type. All the problem is to figure out the proper architecture, which mean you will have to read all the paper on the matter. Also data is indiscriminate generally, so architecture for image would work (and has work) for audio.

2. (some) acting (kinda) quality
On top of what I said above, you can try to generate data, where you pick hi quality sample, then have an amateur say the same thing without listening to the original sample. Amateur being easier to find, you only need to source hi quality data (game audio dialogue dump? audio books?). Then train network by finding the proper architecture. That's style transfer, one of the early application.

Problem:
Audio is a lot of raw data, more than image, and they are sensitive to temporal consistency. Potential to accelerate training by using low sample rate for style transfer training. Then train another network to upscale low sample rate to higher quality.

Billy4184 · Nov 15, 2019

neoshaman said: ↑

Depend on what you mean about quality:
Click to expand...

That's the million dollar question with procedural generation. At the moment I see a lot of stuff that is designed to generate 'realistic' content, as if that is the real metric of quality. But what about content that emerges from stylistic rules? What rules? Are there even rules to what makes something charming? There must be, or everything or nothing would be charming.

If you know the rules that govern beauty, you can turn a scene that is mediocre into something beautiful. You can take intent and guide it toward perfect realization.

But instead it's usually just cloning realism, the thing we probably started playing games to escape anyway.

That's why, unless you have no specific purpose to begin with (which I suppose is often the case), procedural generation is more of a hindrance than a help. Unless you designed it yourself of course, tailored to your purpose.

iamthwee · Nov 16, 2019

I use google to book make and take appointments.

iamthwee · Nov 16, 2019

AndersMalmgren said: ↑

Artists and taxi drivers, no one is safe from automation.
Lucky for us devs that devs are needed for automation
Click to expand...

Seems like this post was automated

neoshaman · Nov 16, 2019

Billy4184 said: ↑

That's the million dollar question with procedural generation. At the moment I see a lot of stuff that is designed to generate 'realistic' content, as if that is the real metric of quality. But what about content that emerges from stylistic rules? What rules? Are there even rules to what makes something charming? There must be, or everything or nothing would be charming.

If you know the rules that govern beauty, you can turn a scene that is mediocre into something beautiful. You can take intent and guide it toward perfect realization.

But instead it's usually just cloning realism, the thing we probably started playing games to escape anyway.

That's why, unless you have no specific purpose to begin with (which I suppose is often the case), procedural generation is more of a hindrance than a help. Unless you designed it yourself of course, tailored to your purpose.
Click to expand...

I dunno it seems that you have narrow exposition to that stuff, because there is a hell lot of non realistic stuff.

HOWEVER we are mostly talking (implicitly) about neural network in this thread, so I'll stay to that.

Style transfer is about painterly style

It can learn about generating stylize character just fine (here anime character because of course weeb would jump on that) https://twitter.com/gwern/status/1095131651246575616?lang=eu

There is neural network that are able to class image based on beauty (define by the input dataset) or to find artistic innovation in painting (based on a data set and clustering).

You can bake your own interpretation of sensitivity into a neural network by simply assigning score to a data set.

And don't forget you can chain neural network together to produce result.

Most stuff are realistic because that's seen as a easy validation step due to the complexity of reality. If it can handle reality, what else it wouldn't?

Anyway there is also non neural network artistic PCG stuff.

@Billy4184 In your case, you can probably show example of good stuff and bad stuff, and teach the network to discriminate between the two. Now like I said, it's less a problem of the network, and discovering how solid your own definition is, as it can impact the quality of the learning, especially if there is hidden contextual data. The thing, you can try.

Search Unity

Unity ID

Useful Searches

This AI Clones Your Voice After Listening for 5 Seconds