Search Unity

  1. Good news ✨ We have more Unite Now videos available for you to watch on-demand! Come check them out and ask our experts any questions!
    Dismiss Notice

This AI Clones Your Voice After Listening for 5 Seconds

Discussion in 'General Discussion' started by Aiursrage2k, Nov 14, 2019.

  1. Aiursrage2k

    Aiursrage2k

    Joined:
    Nov 1, 2009
    Posts:
    4,834
    Rodolfo-Rubens likes this.
  2. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    5,527
    It doesn't do emotional intonation, so no, you need a data set with performance acting labelled, and then on top of that you need a voice director annotation to give broader context, because you can have hundreds of way to express the same emotion, so choosing the right one that fit the broader narrative context is key.

    So you see, I have no doubt AI can do it, the problem is that you will have to first break down voice direction and text analysis in a coherent semantic notation, so you can pass that context to the AI to imitate. Given that's a field that rely a lot on intuition, because well we just live through emotion, we don't need to understand them mechanically, good luck finding that domain definition and use AI on it.

    It's not impossible, it's just a vast field of research for you to tackle, just so you can have some voice in your game.

    Probably style transfer type of technique will be more useful in the short term. Ie a single person do all the voice by itself, and transfer with AI to different voice, think autotunes for voice acting.
     
    SparrowsNest likes this.
  3. Murgilod

    Murgilod

    Joined:
    Nov 12, 2013
    Posts:
    7,163
    I could see it being useful for computer narration that doesn't rely on emotional intonation, like if you had a ship computer that had to talk to the crew or something.
     
    angrypenguin and Martin_H like this.
  4. sxa

    sxa

    Joined:
    Aug 8, 2014
    Posts:
    451
    Someone with experience of something like Vocaloids etc shouldnt have too much difficulty notating it though; they'd already be familiar with doing it (ie via a musical DAW's track editor.)
    That'd be the easy/sane basis for notating it, btw. Intonation via pitch/volume/duration map fairly well.
     
  5. frosted

    frosted

    Joined:
    Jan 17, 2014
    Posts:
    3,822
    I was thinking about giving this a spin a few weeks ago.

    Being able to insert player names (and other runtime generated stuff) into speech can really do a lot for some games. Maybe not ready yet, but soon.
     
    neoshaman and Ryiah like this.
  6. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    5,527
    This is actually the better use case, it seems, since it would simply replace the content and not change teh semantic, assuming you can train the NN with enough sample and the correct architecture.

    Nah vocaloid are for music, not for acting, there is much more to acting than pitch and all of that.

    That said, mediocre acting, ie good enough, can probably be possible so maybe. Infinite mediocre acting that does the job is still better than no acting or typical text to speech. I mean we have to live with so many high profile game having mediocre french dub ...

    Also it could be enough of proof of concept to make good people starting to think about that use case, mmmmm

    It's funny because this and facial performance were teh big thing that separate indie from AAA.
     
  7. Murgilod

    Murgilod

    Joined:
    Nov 12, 2013
    Posts:
    7,163
    ...actually, that player name integration idea without having to record constant variations like in fo4 is a great use for this. It'd take some work to make it flow properly, but it's far more doable that going full vocaloid.
     
    neoshaman likes this.
  8. Aiursrage2k

    Aiursrage2k

    Joined:
    Nov 1, 2009
    Posts:
    4,834
    I am assume they will eventually add tone. It seems like they have too little data, but things like this come in the future.
    <Sad>It was the saddest day of my life</sad><Angry>yet the must hilarious</Angry>

    The same thing for art assets. It will just be a matter of time until


     
  9. AndersMalmgren

    AndersMalmgren

    Joined:
    Aug 31, 2014
    Posts:
    5,406
    Artists and taxi drivers, no one is safe from automation.
    Lucky for us devs that devs are needed for automation :p
     
    iamthwee and TonicMind like this.
  10. Billy4184

    Billy4184

    Joined:
    Jul 7, 2014
    Posts:
    5,383
    What would be good is if it could take an 'amateur' voice and polish it up to sound more professional, including the emotion and intonation.

    All these procedural generation type tools don't really excite me unless they have the ability to convert low-medium quality input to high quality output. That's the way to keep the signature of a game intact when you are generating things, and to be able to manage the fundamental characteristics of the result.
     
  11. Aiursrage2k

    Aiursrage2k

    Joined:
    Nov 1, 2009
    Posts:
    4,834
    They are making great strives in AI, I say give it a decade or so and we will be able to use these systems as input to create user content on the fly personalized to the individual playing.

    ->Feel like playing a "horror game" set on a "titanic" in "space" where the monster chasing you is a "zombie cheerleader".... then this content auto-generates the level almost instantly streams down to you instantly using 5g internet in seconds.
     
  12. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    5,527
    It's not so easy, because the problem isn't AI, it's our understanding of the problem, you can't set up an AI to automate a problem you don't understand.

    Depend on what you mean about quality:
    1. sampling quality
    Well that's the goto NN stuff , super resolution, also creating the data is super easy to train a network, you take high quality audio, then generate low quality from it of various type. All the problem is to figure out the proper architecture, which mean you will have to read all the paper on the matter. Also data is indiscriminate generally, so architecture for image would work (and has work) for audio.

    2. (some) acting (kinda) quality
    On top of what I said above, you can try to generate data, where you pick hi quality sample, then have an amateur say the same thing without listening to the original sample. Amateur being easier to find, you only need to source hi quality data (game audio dialogue dump? audio books?). Then train network by finding the proper architecture. That's style transfer, one of the early application.

    Problem:
    Audio is a lot of raw data, more than image, and they are sensitive to temporal consistency. Potential to accelerate training by using low sample rate for style transfer training. Then train another network to upscale low sample rate to higher quality.
     
  13. Billy4184

    Billy4184

    Joined:
    Jul 7, 2014
    Posts:
    5,383
    That's the million dollar question with procedural generation. At the moment I see a lot of stuff that is designed to generate 'realistic' content, as if that is the real metric of quality. But what about content that emerges from stylistic rules? What rules? Are there even rules to what makes something charming? There must be, or everything or nothing would be charming.

    If you know the rules that govern beauty, you can turn a scene that is mediocre into something beautiful. You can take intent and guide it toward perfect realization.

    But instead it's usually just cloning realism, the thing we probably started playing games to escape anyway.

    That's why, unless you have no specific purpose to begin with (which I suppose is often the case), procedural generation is more of a hindrance than a help. Unless you designed it yourself of course, tailored to your purpose.
     
  14. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,155
    I use google to book make and take appointments.

     
    Last edited: Nov 16, 2019
  15. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,155
    Seems like this post was automated :D
     
    AndersMalmgren likes this.
  16. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    5,527
    I dunno it seems that you have narrow exposition to that stuff, because there is a hell lot of non realistic stuff.

    HOWEVER we are mostly talking (implicitly) about neural network in this thread, so I'll stay to that.

    Style transfer is about painterly style


    It can learn about generating stylize character just fine (here anime character because of course weeb would jump on that) https://twitter.com/gwern/status/1095131651246575616?lang=eu

    There is neural network that are able to class image based on beauty (define by the input dataset) or to find artistic innovation in painting (based on a data set and clustering).

    You can bake your own interpretation of sensitivity into a neural network by simply assigning score to a data set.

    And don't forget you can chain neural network together to produce result.

    Most stuff are realistic because that's seen as a easy validation step due to the complexity of reality. If it can handle reality, what else it wouldn't?

    Anyway there is also non neural network artistic PCG stuff.

    @Billy4184 In your case, you can probably show example of good stuff and bad stuff, and teach the network to discriminate between the two. Now like I said, it's less a problem of the network, and discovering how solid your own definition is, as it can impact the quality of the learning, especially if there is hidden contextual data. The thing, you can try.
     
    Billy4184 likes this.
unityunity