Search Unity

Audio Sensor [Code & examples added]

Discussion in 'ML-Agents' started by mbaske, Dec 10, 2020.

  1. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Hi, I'm wondering if it would be possible to build an audio sensor component, given the current framework. As far as I know, audio can be represented and processed as 2D data (time x amplitude) in machine learning. Therefore I thought one might be able to implement such a sensor using the existing visual/grid observation pipeline?
    A 24kHz stereo signal for instance could be chunked into 1 sec slices with 240x200 (48000) float values. Or maybe the audio can be preprocessed to reduce chunk sizes, like applying FFT first and cut off some low and high bands which aren't critical for observations.
    However, I have no idea how the agent would make sense of signals starting and ending at arbitrary points in time. Unlike with visual observations, a data chunk of fixed size can look very different, depending on when a sound occurs, even if it's always the same kind of sound. A possible workaround might be using a gate/trigger mechanism which only generates observations, once the signal goes above some threshold. But this still wouldn't account for signals of varying lengths or different overlapping sounds. How do audio processing ML projects usually solve this? Are they using LSTMs or maybe observation stacking? How realistic is something like this in a reinforcement learning context, where we don't have any given output classes to train for? Thanks!
     
  2. awjuliani

    awjuliani

    Unity Technologies

    Joined:
    Mar 1, 2017
    Posts:
    69
    HI mbaske,

    This is a very interesting potential project. To be honest, I have not heard of anyone doing RL on an audio signal like this, but I have heard of more general ML projects doing such things. I'd be curious to know more about any task you might have in mind. I think using a RenderTexture for example over the FFT of the audio could indeed be a usable representation. A few years ago I did something similar for a class project to classify audio signals, so I know it is possible.
     
  3. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Thanks for your feedback @awjuliani! I'm experimenting with this a bit, so far I've noticed a couple of things to consider:
    1) Audio runs on its seperate thread, unaffected by the global time scale.
    2) There can only ever be one audio listener in a scene. Although there are workarounds like virtual listeners, I'm not sure they can provide access to their specific FFT data.
    This would limit training & inference to one agent per executable, running in real time.

    3) Since audio is continuous, the observation size depends on the delta time between observations. The sensor would have to know the decision interval up front, in order to provide the correct shape. Or decisions would be triggered by the sensor, when a given amount of samples have been accumulated.
     
  4. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    OK, the good news is this actually works. I was able to build a simple classifier / speech recognition system using "visual" observations. My agent can pretty reliably recognize spoken numbers, even if it hasn't heard a particular voice during training.

    Downsides are the issues already listed above, mainly the real time training constraint.

    You can find a detailed description of the sensor and two example environments here:
    https://github.com/mbaske/ml-audio-sensor/

    I think given the limitations, the sensor is basically a proof of concept at this point. If the team thinks about adding something like this in the future, then please let me know if I should open a PR.

     
  5. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Here's another fun use case: Chord detection for music!
    https://github.com/mbaske/ml-chord-detection

    The agent is trained to identify 11 different chord types and the correct key. Types, keys, chord inversions and transpositions are randomized. Training audio is generated on the fly with a couple of sample instruments, which is supposed to generalize the policy for various types of sounds. A bass note in the chord's key is added 50% of the time, as are random drum loops. The visualization at inference shows the unfiltered cumulative agent guesses as key colors. Generally, detection seems to work best for clean, unambiguous signals. The more complex the audio, the harder it gets for the agent to land on a particular chord.

     
    christophergoy likes this.
  6. doomquake

    doomquake

    Joined:
    Sep 2, 2019
    Posts:
    14
    this is awesome! Thank you for sharing and looking forward to more
     
    mbaske likes this.
  7. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    This is really cool @mbaske

    I'm going to share this with the team.
     
    mbaske likes this.
  8. Ikta

    Ikta

    Joined:
    Sep 30, 2019
    Posts:
    5
    This is really cool!
     
    mbaske likes this.
  9. Romulofff

    Romulofff

    Joined:
    May 4, 2022
    Posts:
    1
    Hi everyone! This project is really amazing and I'm trying to use it to evaluate some competitive agents. However, this limitation of the single AudioListener per Scene is stopping me from advancing with my research (this is part of my Master's thesis). I need to have two different agents listening to each other in order to compete. Is there a way to do this? Also, did anyone found a workaround for the Real-Time Training problem?

    Thanks and also congrats @mbaske for the project! Really helpful :D
     
  10. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Thanks @Romulofff, glad to hear it! Unfortunately, I'm not aware of any workaround for the real-time constraint. AFAIK, increasing the environment time-scale does not affect audio playback speed. Regarding multiple listeners, you could have different agents listening to the same signal, assuming there's one master channel everyone listens to and spatial info can be ignored.
    Since the sampled audio is converted to "visual" observations by the sensor, it might be possible to bypass the audio loop altogether? Are your agents generating / synthesizing audio to communicate? If that's the case, you'd already have the waveform data available. Maybe you could feed that to a matching visual observation directly? During inference, you could still generate corresponding audio output for demo purposes. You would need to look into how exactly the sensor encodes audio, if you were to take this approach. Let me know if I can clarify anything.
     
    Last edited: May 4, 2022