Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Text to Speech synthesis solution with license to use output in games

Discussion in 'General Discussion' started by Martin_H, Jun 25, 2017.

  1. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,436
    I'm looking for an offline text to speech synthesis solution which can generate audio files from text, that I can then process in my DAW and use commercially in a game. One would imagine it's easy to find something like that, but my google search indicates that commercial licensing for all the major solutions is very expensive (ongoing anually payments in the 3 to 4 figure range), or licensing doesn't permit the kind of use I want. Most solutions seem to be aimed at runtime TTS, but I explicitely want offline TTS, only locally on my computer, with a license granting me the right to further process and commercially use the output. Many runtime TTS solutions seem to use OS TTS APIs to generate the speech at runtime, which would still be covered by the license terms of the OS it's being run on, and as far as I could find out in my research none of those allow the commercial use of the generated audio.

    Can anyone tell me if the output of tools based on the open source MaryTTS can be used like I want to?
    http://mary.dfki.de/index.html
    https://github.com/marytts/marytts-txt2wav/tree/gradle
    https://github.com/marytts/marytts-txt2wav/tree/maven

    Or can you recommend other Open Source TTS software?

    If I can't find a free open source solution, then Just hiring voice actors might be one of the cheaper solutions, but I'm concerned with long-term availability of individuals. I'd much prefer to have something on my computer that I can keep using for all eternity to generate more audio.

    Final goal is a moderately processed female computer-voice (bonus points if british), so the "computeryness" inherent to all TTS solutions wouldn't be much of an issue.

    Amazon Polly seems to be the lates and greatest new thing, but I don't understand whether the license terms allow the kind of offline re-use of the audio that I want and how much that would cost me.
    https://aws.amazon.com/polly/

    A cloud service where I can buy TTS audio with full usages rights would theoretically be OK, but still doesn't eliminate the concerns I have with regular voice-actor outsourcing. Services can be shut down too.

    Any suggestions?
     
  2. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,149
  3. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    20,954
    wstelzle and TonyLi like this.
  4. TonyLi

    TonyLi

    Joined:
    Apr 10, 2012
    Posts:
    12,670
    ^ I was just about to post a link to RT-Voice. If you happen to have the Dialogue System, its RT-Voice support package includes a utility that processes all of your dialogue through RT-Voice and creates audio files for each line.

    Your speech synthesis quality in RT-Voice will depend on which voice(s) you're using. With the right voices, you can get pretty good results.
     
    Last edited: Jun 25, 2017
  5. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,436
    If possible I'd like to avoid it because those people might stop offering their services any day (I don't see fiverr offers as proper businesses because at their rates that can't be sustainable for many of them), and what if I want to add some more voiceover in the same voice to the game then? That's the point where you either throw consistency out the window or redo ALL voiceover for that character with another voiceover artist. Even if they are still around a year later, their recording setup might have changed and sound totally different. Audio stuff is tricky. I'd prefer a stable TTS pipeline that still gives me reliable results years later.

    Yeah, I've seen that. But the way I understood it, it actually embeds the whole MaryTTS library to do that thing at runtime, or uses the OS's TTS API, but I explicitely do not want runtime TTS, I want it offline and process the files in my DAW to get the sound that I want.

    By the way, searching on google I already found this very thread on page 1 for one of my searches x].


    If you mean non-opensource voices like they come with the OS or cloud based online services, it is likely you aren't allowed to do what you just described.



    As far as it looks to me the amazon polly license permits very broad use of the output but I have only found vague bits of FAQ and not read the actual license (which I'd probably have trouble understanding anyway).
     
  6. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,149
    Yeah but the speech generated ones even amazon based seemed way too robotic to be realistic, I'm not sure what you're trying to do though.

    The best solution would be to find a family female member/friend who lives in the UK IMO.
     
  7. TonyLi

    TonyLi

    Joined:
    Apr 10, 2012
    Posts:
    12,670
    No, in addition to generating audio at runtime, RT-Voice can actually generate plain old audio files in .OGG (Ogg Vorbis) format at design time, without embedding anything else in them. I'm pretty sure it's legal to use these files on their own at that point, although specific licensing may apply to certain voices. The developer, @Stefan-Laubenberger, may have a better answer.
     
    Martin_H likes this.
  8. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,436
    Like I've written in the first post, a computer voice is exactly what I want. I wouldn't try to replace regular human voices with TTS, that would be ridiculous.

    Since I'm from Germany I can guarantuee I don't have any. Otherwise not a bad idea though.
     
  9. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,554
    It is funny, because I've been thinking abotu something similar few days ago.

    As far as I can tell, main options are eSpeak (horrible, robotic and abandoned, but can be configured to use mbrola voices), maryTTS, and there were mentions of some software called merlin. Surprisingly, this is all that exists out there.

    Commercial solution seems to be hell bent on roping you into their cloud services, meaning for a game they're no-go.

    In your situation I'd try to tinker with MaryTTS and see if it can be used with your own set of voice files. Then I'd look for a relatively affordable voice actor/actress.to record sample lines to be processed by TTS solution.
     
    Martin_H likes this.
  10. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,149
    Martin_H likes this.
  11. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,554
    The problem here it is a research and not a product you can use. That's the primary issue.

    Also, there's matter of computing power required to utilize it.

    Basically... don't get excited over any bleeding edge research, ever. Usually this is kind of stuff that is not the kind of thing that you can immediately use.
     
    Martin_H likes this.
  12. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,149
    Yeah I couldn't find anything remotely useable for tts... Although offtopic, in terms of bleeding edge tech being unusable and needing massive computer power isn't necessarily true.

    A lot of tensorflow comes with google doing all the hard work, all you do is download their pretrained libs. For example, you download inception, for image recognition and you're good to go for image classification (tried and tested). It takes a few seconds to classify any animal picture I give it on my mac mini

    ATM I'm playing around with pre-generated art styles.

    https://github.com/anishathalye/neural-style

    It only takes 5 minutes to spit out an image if using a gpu.
     
  13. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,436
    Interesting idea, but sounds like a whole lot of extra work.

    I've found an official answer to the licensing question. Seems like Amazon polly so far is the way to go.
    https://forums.aws.amazon.com/message.jspa?messageID=772409#772409

    I would theoretically have usecases for that, but not if it takes 5 minutes at screen resolution.
     
  14. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,149
    @Martin_H the amazon fees look reasonable I guess, for a set number of words. It's the most realistic so far.

    [offtopic], 5 minutes isn't that bad, of course if you throw more gpus at it... it gets faster, but not many have multiple titans to spare LOL. Deepart.io is a good place to look, but most of the work is from googles tensorflow, personally I find the image classification useful. I'm going to explore tensorflow a bit more this week. [/offtopic]
     
    Last edited: Jun 25, 2017
  15. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    6,492
    Shhh don't give them idea, just sell an implementation on the asset store. Bleeding edge mean it's inaccessible that have already argue about that, not worth seeing if it's practical or make it practical, even though bleeding edge shader that are sometimes more complex have a pass because. :p
     
  16. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,554
    The pricing actually looks good, but I'd double check that they allow use of output in video games.

    If I were making some sort of "ship voice TTS", I would try to utilize one of opensource projects with a custom voice data.

    However, based on cursory google search it looks like custom voice creation process for marytts is pretty much opposite of being user-friendly. (just take a look at this, for exxample: http://www.dfki.de/pipermail/mary-users/2015-November/001784.html ). So amazon polly may be a safer option after all.
     
    Martin_H likes this.
  17. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,436
    They are ridiculously cheap compared to http://www.speech2go.online/pricing where for 4 Euro you would get 600 characters instead of 1 million. Those 2 should be very similar voice tech because that's from the company that amazon bought I think.

    I'll try, but based on the thread I've linked, it seems like you need to pay for premium support if you want any "more official" answer than what you can see in the thread, which was the second attempt to make such a thread already by that user. But when they allow to resell the audio to your clients in freelance projects, I'm fairly confident that games are gonna be ok because I'd consider resell/sublicense rights to be on par or one step above the level of rights that I need.

    5 Minutes at that 500x500 resolution is super bad for what I need. I rarely work on things below 3k x 3k when painting, I have some files in the 7k x 7k upwards range. And their highest paid tier which takes a day to process is damn expensive, and doesn't even support any higher than 3k res. At those speeds I can likely paint quicker by hand for the highres images and considering that I'd only ever use this as an intermediate step to transfer something like a 3D mockup I rendered to get perspective and cast shadows right, into something that looks like I've painted it, and then further refine by manually painting over it. I need something crazy fast for that, so that I have rapid iteration times at high resolutions.

    What it could be interesting for is, to use the lowres images created from a combination of abstract 3D renderings and macro photography of things like meat etc. to create inspirations for surrealist paintings and then paint them on canvas with oils or acrylics.
     
  18. iamthwee

    iamthwee

    Joined:
    Nov 27, 2015
    Posts:
    2,149
  19. Martin_H

    Martin_H

    Joined:
    Jul 11, 2015
    Posts:
    4,436