A Unity ID allows you to buy and/or subscribe to Unity products and services, shop in the Asset Store and participate
in the Unity community.
Discussion in 'ML-Agents' started by EternalMe, Jun 18, 2022.
still hoping to see something
I just found
ML-Agents Release 20
will you be now updating it more regularly? how's the outlook?
I'm very excited to see that ML-Agents is still alive! I'm going to look into updating my ML-Agents Airplanes course to the new version. Actually, a slower update cadence than how it used to be, would (in my opinion) be welcome. It was nearly impossible to keep educational content up to date with breaking changes every month. As a result, I think a lot of new users got overwhelmed with trying to learn how to use a rapidly changing product (that was already complicated).
Please let us know of your schedule updating the courses whenever you can! Your material's extremely useful
ML agents went from dead mode into low energy mode:
- No help from devs in forum
- Issues in github are are mostly unhandled
- The new release felt very tiny and everything after it
Yeah, definitely need help from devs in forum!
So, pretty much dead. Our fears have been largely confirmed... It may be high time for a community-lead effort to resuscitate it, which I am not 100% sure how would work since otherwise there's not going to be any serious additional development.
@TV4Fun You can have a look at Peaceful Pie, (full disclosure: I created it)
" a bunch of new features in Q4 of this year, sometime in October/November. Stay tuned!"
Any word on the features?
With releases like SciSharp .NET stack and Tensorflow.NET why do we still have to interop with python? C# can be up to 20X faster than python.
> C# can be up to 20X faster than python.
I suspect c# can be a loootttt faster than that.
I benchmarked lua vs python a while back. Lua is another scripting language, like Python. My benchmark showed that Lua was 50 times faster than Python.
That said, for machine learning, the speed of python doesnt really change much because you just take [some giant tensor], and pass it into a C++-backed op, that returns [another giant tensor]. Any "tight inner loops" are all inside that C++-backed op.
But I do agree that keeping stuff in one language (e.g. C#) is going to make one's program a zillion times easier to maintain. And the interop is not typically going to be super fast.
Main reason to use Python is probably if one is *already* using Python. A lot of machine learning researchers for example have our IP already in Python. Unity is just-another-environment we can use.
I certainly get it and agree with you on the basics. Things is that Unity beats out the Jupyter notebooks and even Wolfram is bass ackwards with running Unity in some kind of docker with it. Unity should be a native environment for AI/ML with no interops BS. You can't make a proper enterprise product with the current spaghetti I/O that is required.
Honestly coming from Windows Python feels like fiddling with batch files from 1992. Needed half a day of hard work to complete the installation because newest Python version doesn't support PyTorch (one of the 999 packages you have to install by hand, whatever it does) anymore. Was lucky to find a comment on a youtube video to understand the problem. If you are used to one-click installers installing all they need by themselves you feel like back to stone age.
Will drive many people away from playing around imho. You can say what you want, but Installing/adding stuff was a breeze in Unity... It just worked. I mean, before package manager. ;-)
EDIT: preferred way would be to download a package from Asset Store installing all things needed. Then work only in C# inside Unity, without having to care about any interface issues.
> You can't make a proper enterprise product with the current spaghetti I/O that is required.
Do you mean, you want to use AI in your game, to learn behaviors during game-play?
I agree that currently, mlagents only really targets running inference on end-user devices, not learning itself.
I believe that Unity's position on this is that reinforcement learning takes so long, needs so many training examples, that learning from human feedback is not very practical. Though actually since ChatGPT uses human in the loop feedback for training, https://arxiv.org/abs/2203.02155 , such arguments might change at some point. But note that ChatGPT is not really learning in real time: they take human feedback, use that to train a proxy model, and then do their giant i-need-a-nuclear-power-plant-to-run-this chatgpt training bit.
What is your approximate use-case that doesn't fit into mlagents current framework? (obviously keeping the description general enough to avoid revealing your actual target product).
> PyTorch (one of the 999 packages you have to install by hand, whatever it does)
PyTorch is the neural network that does the actual learning https://pytorch.org/
You see the problem. I'm sure I'm not the only idiot here.
I am in the NLP/speech synthesis/inference and intent engine field currently. I would like to be able to build and train a Neural Net TTS in a Unity runtime app as an example. I don't care that it takes time to train the NN. I set it up so you can watch graphs and epochs unfold and change as the training proceeds. I will watch a disk defragger do it's thing with no issues as long as I can see a progress bar or number change so I know I am having progress on an ongoing op and how much is done and what is left to be done.
> You see the problem. I'm sure I'm not the only idiot here.
Oh yeah, fair. I'm not actually a game developer, I work in AI Research. So I use pytorch every day at work, so, different backgrounds and knowledge.
I do agree that mlagents seems radically different from other parts of Unity that "just work". Like, by default, lighting and so on simply work, and you can modify stuff in the interface. Having to open a command prompt and run ml-agents seems to diverge from that experience to me. It feels to me more like an experiment, than something that is part of their core product.
To be honest, I'm not sure exactly who the target audience is for mlagents. For researchers, it doesn't give us enough power or control (which is why I wrote PeacefulPie). For game developers, the having-to-use-a-terminal bit feels a little outside of the normal smooth Unity workflow, as you allude to.
But another thing as far as games development is: I'm really not sure how RL fits into game development to be honest? I feel that most game AIs will work really well using heuristics and stuff, sort of the Code Bullet approach to AI, like or , or really, things like . I'm not sure that RL is the best way to make fun interesting enemies and so on in games? I feel like RL is more a research-y thing in the first place?
It can be accessed with https://github.com/SciSharp/Torch.NET which is part of this stack. https://scisharp.github.io/SciSharp/
So I tried a few days back to get these libraries into Unity but ran into a snag in that one of the libraries uses a file type with a .meta suffix which caused a console full of red screaming for GUID's for these files as they did not conform to the Unity .meta file..
> I am in the NLP/speech synthesis/inference and intent engine field currently. I would like to be able to build and train a Neural Net TTS in a Unity runtime app as an example. I don't care that it takes time to train the NN. I set it up so you can watch graphs and epochs unfold and change as the training proceeds. I will watch a disk defragger do it's thing with no issues as long as I can see a progress bar or number change so I know I am having progress on an ongoing op and how much is done and what is left to be done.
I feel like text to speech (and speech to text) are supervised learning problems, that need tons of data, and for which off-the-shelf solutions already exist?
Have you considered using e.g. OpenAI Whisper for speech to text, https://openai.com/blog/whisper/ (note: they appear to only provide python examples, but I imagine they must have an http API somewhere, that you can simply call from C#?)
Similarly, for text to speech, have you considered using e.g. Google text to speech? https://cloud.google.com/text-to-speech
(Edit: I'm probably misunderstanding your requirements here?)
Yeah. This is my issue with the ML-Agents.. Not actually useful outside trivial game uses I would solve with some conditional loops and a weighting system like I have been doing with my version of martial arts procedural IK rig training. But I am more interested in inference and intent engines, orthographically generated TTS synthesis, Neural Radiance Fields and custom training interfaces. I started making my own setup with Unity components last week but got sidetracked after beginning it.
> So I tried a few days back to get these libraries into Unity but ran into a snag in that one of the libraries uses a file type with a .meta suffix which caused a console full of red screaming for GUID's for these files as they did not conform to the Unity .meta file..
Hmmm That sounds challenging. Maybe you can build a C# plugin dll, that you drag and drop into the Unity project?
I have been thru every TTS repo. The best quality and most lightweight is Larynx. If I sell a product for embedded systems to an automobile OEM they are not going to want central server voices nor a bunch of junk running on their HMI automobile internal networks. I do not do games. I solve enterprise issues and am more interested in HUDs for jets and autos, remote natural human interfacing hands free comms for field repairs, smart assistants for engineering, scientific and construction projects and similar.
I have all the major corpus data and even created a lightweight set of training corpus based on IPA alphabet phonemes. I have no interest in promoting google, azure or other big tech monopolizing hive minds with any of my work. I am sure millions of others feel the same. Be nice as well that my application would work properly without a network connection. Some applications are forbidden to access anything outside their sandbox.
I need a bit more knowledge. I find proper scraps here and there. Most of it has been the same old diagram of the cross linked neural nets an they never speak to what exactly is occurring under the hood. I read some stuff from the fellow who started that repo and get it as far as the architecture and what each layer of my neural net might look like for training or cloning a voice. I have been pretty good at making tools for Unity devs over the years and have a fairly decent idea for multi-purpose component sets you could drag and drop, check some bools for operation types, weight and bias graphings, set the next layer neurons, ground truth comparators and configurable tensors.
So here is what a Unity component stack might look like..feel free to add/subtract/criticize.
NNConfigurator - Top level component that controls the number of neurons in each layer and connected neurons layer to layer.
NNNeuron - Data type to be operated on..string, number, audio, image, video, gameObject/s, transform/s,
NNOperator - Used by NNNeuron to control the operations performed on the data type. Has all standard ops that Numpy/Pytorch has.
NNComparator - Compares the output of the NNNeuron transformation to a ground truth and returns a value to be used in weighting and biasing with each more successful transformation overwriting that part of the NNTensor.
NNTensor - holds the transformations of the operators derived from comparing the results of an operation to the ground truth.
NNEpochController - Controls the recursion thru the NNNeuron Layers, graphs the ongoing weight and bias of the NNTensor versus ground truth. Will bail out with solution in NNTensor when Epoch's have either reached a user defined threshold, number of epochs or when ground truth has been achieved across multiple samplings.
NNRuntimeGraph - Uses the NNTensor to perform transformations on objects in accordance with its training.
Oh... I actually disagree with the premise that it's possible to design neural network architectures through a relatively simple graphical interface. I think that making such a thing available would be making a similar mistake to mlagents in the first place, i.e targeting a level of abstraction that doesn't quite mesh with wha. I've seen people try to build frameworks to allow people to build networks from config files and such, e.g. https://github.com/asappresearch/flambe I won't say such things don't work at all, but they tend to work within limited niches, and not very configurable except for a few specific tasks/problems, in my opinion.
I feel that for Unity to make AI available to game developers, they either need to go much higher level, e.g. give us access to speech to text APIs, or text to speech APIs, or go much lower level, and give us access to C# or Python. Given access to C# or Python, we can use existing AI frameworks, such as the ones you pointed out, to create our own AIs, with full flexibilty and control over the frameowrk.
And Unity already gives us access to C#. So, really, I feel that Unity don't need to do anything at all.
As far as speech to text apis, and text to speech apis, note that such things are non-trivial to build. They need massive amounts of data, and typically specialized models and expertise. I do not think that Unity should build their own models or services: they should simply proxy such APIs through to OpenAI Whisper, or google text to speech, or similar. But running such services is not free: it's very expensive. the models are huge, and you need powerful GPUs to run them. These services cannot run for free, and Unity would need to pass that cost onto us. That's probably a huge can of worms, like people would probably complain about having to pay for them and stuff , and maybe why they don't do that.
An example of how the above may be used for training a TTS system with a voice using IPA pronunciations.
1.Lists of words are classified into groups where one particular IPA phoneme is used in each word but is in various combinations to other IPA phonemes to form various diphthongs. The user speaks each word as they flash on the screen and is recorded, labeled with the word and stored as an AudioClip for referencing as groundTruth during the NNTensor adjustments.
2.The audioclip is analyzed by FFT/Mel Spectrogram, rmsDecibels, coarse and fine transient peaks, root pitch frequency.
3. The FFT is transformed into an MFCC, a cosin operation, which allows extractions of the audioclip features and can identify the various phonemes in a words, given that it already has the labeling of the clip itself and wiktionary to extract the order of the IPA phonemes.
4. As it extracts the IPA phonemes it adds the cut up audio into phoneme banks that then get averaged per phoneme to create a bank of Archetypes. There are 53 of these for sounding out most any word in any language.
*more can be extracted such as pitch over a whole word or reading various sentences with surprise, anger, sadness, gruff, melodic etc.. to extract pitch curves for adding emotional lyricality and beat frequency/speaker rhythm to generated speech..
The NNTensor would then hold the transformations per phoneme archetype depending on the surrounding combinations of IPA symbols.
1.Go back thru the list of words and reconstruct the word from the phoneme archetypes. In the NNNeuron it would be an array of IPA phoneme archetypes that make up the word. Each NNNeuron would shuffle, abut, fade in and out, move the pitch up or down. This would basically be a back and forth x operation, up and down/pitch y operation with values for fade transitions.
2.For each series of transformation on the archetype reconstructed word compare it's graphs against the groundTruth analysis graphs and weight them by difference with 0 being an exact match with ground truth to the orthographically concatenated synthesis.
3.Each NNNeuron in a layer completes it operations and a ratio of the lowest weighted configs get passed to the next layer of NNNeurons who take the values and adjust them with their closest to zero configs getting passed and adjusting the NNTensor values at each step.
The NNTensor after training has numbers at known indexes in various arrays that given a text IPA input can use the MEL Spectrogram FFTs, concatenate them in the time domain per rules defined in the NNTensor, perform an inverse FFT to derive an audioclip.
Advantage is it is all inside Unity, you can probe it at any point with any C#, graph every last thing or have it run as an invisible process and play with the input and outputs, clone voices, take characteristics of one and apply to another..etc.
I have used Barracuda to creating remote gymnastics teaching app for an Olympics coach for pose detection against a "perfected pose" for scoring students posture and motion. With intent and inference I have been working with the original patented Markov Chain/RNN inference engine that google and the rest cite in their patents. It beat google, Alexa, Watson and Siri in a contest by a panel of NLP experts. It is portable and weighs in at 4MB and is in C# .NET, the brain files the Studio exports work just dandy in Unity and prompts or responses can execute command and control ..I have it working dandy with Vosk open source voice recognition and Larynx TTS on a server with various portable native TTS solutions for Mac, PC, Android, iOS. We have wrapped Vosk as part of our Core.dll plugin for Unity.
For more exotic mathematics there is https://www.mathdotnet.com/
From a GameDev perspective I think anything that can help in content creation is a good fit for ML. At least for me lack of content is usually the show stopper for any game. And I would gladly pay for telling an art lib: "create a dungeon with a central room of 100 m² on two levels, upper level should have connection to lower level by stair05 and a railing. The room is connected to 2 storage rooms by between 10m² and 50 m² filled with props. The connection should be tunnel01 tiles and should include one level of stairs03 or stair stair01."
Alternatively, any visual tool would be welcome, too.
I mean, start dragging around walls and floors in Unity Editor to understand what I mean. So automated content creation is king.
> An example of how the above may be used for training a TTS system with a voice using IPA pronunciations.
I got most of the IPA language sieve data in ScriptableObjects so they can be used for other language models. I have every corpus used for training the large language models. I don't think they have to use them. When I stumbled across Larynx they seem to have used the same MEL Spectrogram minimalist approach and used the GlowTTS vocoder. It is fast unlike Coqui which we had running in python using interops and command line prompts. Larynx under the same environment was 10-25x quicker and the voices much better. New voices can be trained with as little as 30 minutes of audio. It even supports SSML properly an a viseme/phoneme timing stream which none of the big boys support properly. But I have to run it off a server via REST API. I would like to do something like that but native in Unity with a component based framework that can be extended. Maybe I should start a github repo if there is interest.