Search Unity

Speech Recognition Engine / Recording Devices

Discussion in 'VR' started by mikewarren, May 22, 2019.

  1. mikewarren

    mikewarren

    Joined:
    Apr 21, 2014
    Posts:
    109
    I wasn't quite sure where to put this post as my target platform is an MS Hololens, but my questions are really about the Unity speech recognition API.

    I'd like to be able to use speech recognition in a Unity application on a Hololens with a microphone other than the microphone array built into the HL device. I've seen and been told conflicting information on whether it's possible so I'm looking for more details.

    The Unity speech API layer is fairly sparse. When a (Unity) recognition engine is instantiated how is the audio input device chosen? On desktop systems, I assume it's the default recording device on the sound panel? If so, is there a similar option on the Hololens? Or, an API to set the default recording device prior to creating a speech recognizer.

    I'm assuming the Unity speech API is a layer on top of the underlying Windows speech API. Any reason I couldn't just implement a windows speech recognizer directly? (For instance, resolving extra assemblies?) Which API is the Unity speech API built on?
     
    Last edited: May 22, 2019
  2. Tautvydas-Zilys

    Tautvydas-Zilys

    Unity Technologies

    Joined:
    Jul 25, 2013
    Posts:
    10,674
  3. mikewarren

    mikewarren

    Joined:
    Apr 21, 2014
    Posts:
    109
    Thanks @Tautvydas-Zilys. That helps.

    I know that on Windows 10 (standard) that I've been able to change the audio input device via the Sound panel (default recording device). I don't see any such construct on the Hololens. Anyone know if there's an API to enumerate and change the default audio device?
     
    Last edited: May 23, 2019
  4. Tautvydas-Zilys

    Tautvydas-Zilys

    Unity Technologies

    Joined:
    Jul 25, 2013
    Posts:
    10,674
    I unfortunately do not.
     
  5. timke

    timke

    Joined:
    Nov 30, 2017
    Posts:
    407
    From my understanding there's no API in Windows (desktop or otherwise) to change the default audio device; only the user is allowed to manipulate this setting. So the only way this could work is by hacking the Registry on the HL.

    I don't know if it's even possible to manipulate the Registry on HL, but changing the active audio capture device under this key: HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\MMDevices\Audio\Capture should do the trick. Here's a post with a few details on this: https://superuser.com/questions/1054594/switching-default-audio-device-with-a-batch-file

    However, even if you get past all this and get Speech Recognition to capture from an external microphone, it probably won't work very well (if at all). This is because the Speech system must be calibrated with the microphone to filter (cancel) out background noise, reverberation, etc. to produce a clean signal. Unlike Desktop Windows, HoloLens is designed to only work with the built-in microphone array, and (AFAIK) the calibration data cannot be changed so it'll use the same filtering on the audio stream from the external microphone. That is, it'll apply incorrect filtering producing a worse signal than if no filtering was used.
     
  6. mikewarren

    mikewarren

    Joined:
    Apr 21, 2014
    Posts:
    109
    @timke Appreciate the feedback.

    I don't know of an API to change the default audio device either, but if it involve registry manipulation, I don't want any part of it anyway. I have an open dialog with MS and I'm trying to get an informed determination.

    I thought the audio processing (filtering, beam forming, etc.) was part of the device driver DSP, not the speech system, and that the audio sample data was pre-processed prior to hitting the recognition system. If so, I should be able to substitute (theoretically) any audio sample source (even recorded data)..

    https://docs.microsoft.com/en-us/wi...ssing-modes#available-signal-processing-modes

    For instance, the Hololens produces a Communications, Speech and Other (environmental) stream from the built in microphone array. The Communications and Speech streams use the beam forming / filtering technology you cite, whereas the Other stream does something different. (I've recorded samples from each stream in the same environment and it's startling how well the Speech stream filters noise.)

    https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input#communication
     
  7. shaho1763

    shaho1763

    Joined:
    Mar 23, 2020
    Posts:
    1
    hi
    I want to work on virtual reality. I want to talk to Unity about converting speech to text and working on conversations. please guide me. Thanks.
    my email : shaho1763@gmail.com