Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

Reward self-play, how can an agent understand for what actions?

Discussion in 'ML-Agents' started by LexVolkov, Mar 28, 2020.

  1. LexVolkov

    LexVolkov

    Joined:
    Sep 14, 2014
    Posts:
    62
    Usually, the "Reward" tells the agent that his action, which just happened, is correct.
    But if you use the recommendations for self-play. A positive reward is given immediately to the whole team after the completion of a certain condition. But as a separate agent, he will understand what his actions led to such a victory?
     
  2. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @LexVolkov,
    From the self-play documentation:
    The reward signal should still be used as described in the documentation for the other trainers and reward signals. However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games.
    Since agents on the same "team" are executing the same policy, you can assume that rewards are working as they normally would for the current policy execution.
     
  3. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Hi @LexVolkov ,

    What you're identifying is a very hard problem faced by multi-agent scenarios (and also just sparse reward reinforcement learning in general) known as credit assignment. Consider the scenario where two soccer agents are on the same team, and one scores while the other is doing something nonsensical that does not impact the game play. You are correct, the agent might learn to associate this nonsense with goal scoring/reward which would be incorrect. This makes it a hard learning problem, and to combat this, we train the agents for many timesteps and use many samples per update. The idea is that if we use many, many training samples, we can in some sense "average out" nonsense actions like this.

    TLDR; Team level rewards can create credit assignment issues for the individuals, so we use very large batches to "average out" these issues.

    Hope this helps. Let us know if you have any more questions.
     
    rufimelo99 likes this.
  4. LexVolkov

    LexVolkov

    Joined:
    Sep 14, 2014
    Posts:
    62
    Thank you, this is what I wanted.

    That's about the trainer configuration settings.
    Although there is an official description. But to Me as a simple user, poorly versed in a deep understanding of neural networks. It is difficult to understand what settings affect what the agent is. I would like more details, or examples.
     
    Hsgngr likes this.
  5. Hsgngr

    Hsgngr

    Joined:
    Dec 28, 2015
    Posts:
    61
    Hi @andrewcoh_unity That was a really good answer but I am agree with @LexVolkov that documentation or links for configuration setting of the neural network could be really helpful for beginners like us.
     
  6. andrzej_

    andrzej_

    Joined:
    Dec 2, 2016
    Posts:
    81
    I've been just talking with someone who spend a ton of time experimenting with ML-agents and we both agreed that it would be incredibly useful if, even for the existing examples, Unity could do a hyperparameter search and provide detailed graphs for each run, for every parameter. It takes a ton of work on environments and agents to get any intuition about this, so a learning resource like this would be very helpful for many people, not only beginners.
     
    Hsgngr likes this.