Search Unity

  1. All Pro and Enterprise subscribers: find helpful & inspiring creative, tech, and business know-how in the new Unity Success Hub. Sign in to stay up to date.
    Dismiss Notice
  2. Dismiss Notice

ML-Agents Visual Observations best practices

Discussion in 'ML-Agents' started by nyonge, Jan 1, 2020.

  1. nyonge

    nyonge

    Joined:
    Jul 11, 2013
    Posts:
    43
    Are there any guides or rules of thumb for best practices in feeding an agent visual observations?

    Specifically, what should I feed it, and what limitations should I abide by? Ofc simple is better, but like... 40x40 B&W demo of the environment? 256x256 colour camera view? 4K HD copy of the screen? What resolution and visual data limitations are good to accommodate

    The only example using them in the docs seems to be GridWorld, and I can’t find much in the docs on visual observation beyond how to use it - nothing for optimal usage. Lmk if I missed something.

    Thank you!
     
    nofreewill42 likes this.
  2. dracolytch

    dracolytch

    Joined:
    Jan 1, 2016
    Posts:
    15
    Visual observations are a beast to train on. I have a YouTube video on it somewhere... But basically speaking, it takes a long time to both learn to "see" and to learn the game. Start small, and start grayscale.
     
  3. nyonge

    nyonge

    Joined:
    Jul 11, 2013
    Posts:
    43
    Cheers. If you can dig it up, I'd love to take a look at that video.

    When you say "to both learn to see and learn the game", is there emphasis on both - like, will it mess with the agent to learn visually and numerically simultaneously? Beyond the inherent difficulties of learning on either method.
     
  4. dracolytch

    dracolytch

    Joined:
    Jan 1, 2016
    Posts:
    15
    Found it:
    https://www.youtube.com/watch?v=kiy086gRbeE

    If you pass in visual observations, it has to make sense of those input pixels, and weight those pixels to come up with an appropriate action. There's a lot of noise, and not much signal. When you provide vector observations, you're essentially telling the system what's important, and skips right to the weighting actions against world state.

    There are absolutely problems where training on visual observations makes sense... but those are usually cases where you don't have access to the underlying game state (this is where cutting edge research on hundreds of computers is happening). In those cases where you have access to the game state, it's almost always preferable to use it.
     
    nofreewill42 and nyonge like this.
  5. nyonge

    nyonge

    Joined:
    Jul 11, 2013
    Posts:
    43
    Gotcha, so limiting the signal to noise ratio. That makes a lot of sense. Thanks for the youtube link!
     
  6. caioc2

    caioc2

    Joined:
    May 11, 2018
    Posts:
    8
    About the visual observation itself, stick with the 84x84 size making sure that at this size the desired information can be "seen". Also remember that you collect #buffer_size images before training and it need to fit in your RAM or GPU RAM. It is pretty easy to blow up all your GPU RAM depending of the buffer_size and image size you are using.

    Chosing Gray vs RGB is very dependent of your use case, color information can be very valuable for example in a autonomous car in a road with yellow strips and green vegetation in the margins, dont underestimate it.

    Using vector observation versus visual observations is very debatable, it may be easier to the neural network build its own undertanding from the world using visual observation than learning human concepts and its encoding with vector observations.

    Three important things about an agent are: What it observes, what actions it can take, and the most important how you reward the good and bad actions. The rewards are the signals that teaches it which actions connects with which observations, it is a lot more important than the kind of observation as long as it contains the needed information. With bad feedback on its actions it will learn nothing be it vector, visual or any kind of observation.

    I had one case where the visual observation was much better than vector observation (about 2x the performance). The downside is that training and inferencing with visual observation takes a lot of time.

    In the end, the truth is that it all depends of your case and implementation, a single modification can take you from zero-to-hero and vice-versa. Dont take general conclusions from your first (or various) try. But use your tries to learn how to develop better agents, environments and specially rewards.
     
    Dan_G, nyonge, mbaske and 1 other person like this.
  7. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    362
    Is it correct to view convolution as a form of data compression? In my understanding, convolutional layers extract higher level features from visual inputs. But that must come with a cost to precision, right? How about a situation like this: You have a strategy game with a variable number of relevant items, NPCs or other players spread out across some area. I imagine this would be hard to represent as vector observations. But you could generate a kind of minimap overview and use that as visual observation input. How hard would it be for an agent to detect subtle changes in the game state? Does something like this necessarily require a large model and millions of training steps?
     
  8. caioc2

    caioc2

    Joined:
    May 11, 2018
    Posts:
    8
    I wouldn't call the convolution and downsampling process a compression because it is not exactly reversible, but more like a (learned) mapping from the image to features.

    About your example I cant say for sure, but I dont think a mini-map is good. Let's assume it is a view from top of your map, most parts of it will be constant or rarely change (even more if it is downsampled to 84x84) and for learning we need variation in the observations.

    For the model size, take this with a good pinch of salt, I believe it must be somewhat proportional to the state space and the desired behavior. The confusing part is that it is not just the number of inputs (vectors or pixels) but the underlying information you want to learn. For simple tasks like the agent collecting boxes using visual obsevation, I used two layers with 512 units each and it worked nice, but the best thing you can do is test various layouts.
    The same can be said to the training steps, and it is very dependent of your reward signal, on the same past example with a bad reward, it learned nothing after 40m steps, while it was doing ok after 500k steps with a good reward.
     
    Dan_G and mbaske like this.
unityunity