Search Unity

How to do gesture recognition?

Discussion in 'AR/VR (XR) Discussion' started by Follet, Feb 19, 2020.

  1. Follet

    Follet

    Joined:
    May 18, 2018
    Posts:
    38
    Hello,

    I'm developing a game where I want to do hand gestures; when moving the hand in certain patterns and rotations, it does certain things. Like casting spells in The Wizards VR

    Exxample:


    I already came up with some ways to do it, like creating colliders and checking the order in wich those are triggered and with which hand rotation. Or recording manually and then comparing the positions/rotation of the hands every a set amount of time.

    But both of these methods (also others I didn't explain) are expensive and since I'm aiming to do it for the quest I need the best possible performance.
    Are there any better ways to do it? Does XR already have something to do this? I looked a lot around the internet and found nothing.

    Thanks for your time.
     
  2. JoeStrout

    JoeStrout

    Joined:
    Jan 14, 2011
    Posts:
    9,859
    What you're asking for is actually gesture recognition, or possibly stroke recognition (giving you some terms to search for). And an easy and effective way to do it is as follows:

    If the user needs to indicate when the gesture begins and ends, for example by holding the trigger, then simply record the position of the hand over time while the trigger (or whatever) is down. Then find the plane that best fits those points (this is a simple linear regression problem), and project all the points onto that plane. Then draw a bounding box around the points, and square it off (i.e., make both width and height equal to whichever is greater.. Finally, divide that box into a (say) 3x3 grid, assign each grid cell a character (you could use 1-9 like on a numeric keypad), and build the string of characters indicating the cells the gesture passed through.

    So for example, a vertical stroke downward (if you're using numeric-keypad labeling) would come through as 852. A horizontal stroke to the left would be 652. A diagonal stroke would have several variations like 78563, 74523, etc. Just record these strings as you're doing the stroke, and store them in a lookup table. To make your algorithm even more robust, if an input doesn't match any of the stored strings, calculate which one is closest (by Levenshtein distance) and treat it the same as that.

    I've done this for mouse gestures, and it worked really well. I've never done it for VR, but the only added wrinkle there I think is that initial step, projecting the points onto a plane. Once you work that out, the rest should be straightforward.
     
  3. Matt_D_work

    Matt_D_work

    Unity Technologies

    Joined:
    Nov 30, 2016
    Posts:
    202
    Gesture recognition is hard, especially robust gestures.. We do provide ways to hook into platform provided gestures (eg: Magic Leap / Hololens ) however we do not provide a generic gesture recognition platform.

    I'm not sure if there's one in the asset store, but going with joe's suggestion above would be a good start :) or you could always start looking into Machine Learning models ;)
     
    Follet likes this.
  4. Follet

    Follet

    Joined:
    May 18, 2018
    Posts:
    38
    Thanks for the well explained reply! I started by doing a 5x5 grid of colliders to test how it works with the collider method. I'll also try your method since it looks quite nice and no doubt more precise (but probably more expensive... or not, we'll see!). Btw, I didn't know about the Levenshtein distance, and it is going to be definetly usefull! If not for this project, for a future one :)

    Oh I see. I didn't find any in the store. So, in that case I'll try my best to create the best one I can, and If I end up doing a enough robust way of recognizing gestures I'll consider to polish it and upload it to the asset store in the future. Let's hope that the vr dev comunity can grow strong enough to make vr what it deserves to be.
     
    Last edited: Feb 19, 2020
  5. Habitablaba

    Habitablaba

    Joined:
    Aug 5, 2013
    Posts:
    136
    This is an interesting topic, and I think quest’s hand tracking is causing a lot of people to start thinking along these lines.

    I’m for sure in that boat. But I’m more interested (right now) in detecting static hand poses. All the stuff I’ve looked up so far has basically been how to generate a skeleton from an image of a hand. But with the quest’s hand tracking, you get the skeleton for free. Where I fail is trying to figure out if that skeleton is in a specific shape - like “peace” or “live long and prosper”.

    On the surface, it sounds easy enough to me to just grab vectors between each of the joints, then use a series of dot products and distance calculations to determine what pose is happening.... but in practice, this is arduous and fiddly.

    @JoeStrout does your advice translate to this application? I’m having a hard time wrapping my brain around this problem, and thus around the solutions as well.
     
  6. Follet

    Follet

    Joined:
    May 18, 2018
    Posts:
    38
    From what @JoeStrout said, you could use a Vector3 or float[3] array and use the Levenshtein distance to measure the difference between the expected pose and the user's pose. It should work, It's basically what you might already be doing using maybe a different method. Check the wiki link about tuhe Levenshtein distance if you're interested: https://en.wikipedia.org/wiki/Levenshtein_distance it has some code and examples.


    As if the following didn't exist. I've been looking into it and I'm not so sure now that it might really be usefull, but in case it is, I'll leave it here...
    You can also look for the Central Limit Theorem

    Good luck!
     
    Last edited: Feb 21, 2020
  7. JoeStrout

    JoeStrout

    Joined:
    Jan 14, 2011
    Posts:
    9,859
    For static pose recognition, I would probably choose some key distances and measure those. For example: between each fingertip and the neighboring fingertip, and between the fingertips and the palm of the hand.

    This gets you a collection of 10 or so numbers, i.e., a point in 10-dimensional space. Then you can pretty much just find the closest example in your lookup table of known poses, just by using Euclidean distance (i.e. sum of the squared difference in each measurement). Whichever one is closest is the pose you pick, with some cut-off distance at which you consider it to be no known pose.

    (If you want to sound fancy this is the "k-Nearest Neighbors" (kNN) algorithm with k=1.)
     
    Habitablaba likes this.
  8. Habitablaba

    Habitablaba

    Joined:
    Aug 5, 2013
    Posts:
    136
    Thanks for this idea. I lost myself in a rabbit hole of Wikipedia, so that’s cool.
    I’ve implemented a version of this. Now I just need to generate a crap ton of sample data. Until that’s done, it is really only theoretical if this will work for me or not.

    My biggest concern is that there are going to be a lot of poses that all calculate to the same ‘distance’ from multiple target poses. For example, if the user is pointing with their index finger and thumb, it seems to me that the distance would be very similar when compared to pointing with your index and pointing with your middle finger (or ring finger, for that matter).
    I guess I won’t really know for sure until I generate some test data, though.