Search Unity

Question 2D array sensor

Discussion in 'ML-Agents' started by topitsky, Jan 27, 2023.

  1. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    I have simple maze problem that I've gotten working somewhat with just vector observations. The agent basically gets a one hot encoding of its surrounding from a 5x5 array, containing the types of tiles (wall,end,etc.). It gets rewarded when it finds the end, and 1f/MaxSteps when it discovers a new tile, and this scheme seems to be working quite well. It can get to mean rewards upwards to 0.95f, but the standard deviation is still poor, and it tends to get stuck in places where there are many walls. I also use discrete action masking, the agent can go in 4 dirs if theyre acceptable.

    Then I discovered that presenting info in this vector format, essentially 1d can be confusing for the agent, and then started looking into grid sensor. The grid sensor works with collisions, and my tilemap is just a class with numbers. Is there a way to "hack" the ISensor interface so that it treats the incoming data as visual? I've looked at this https://github.com/mbaske/grid-sensor but couldnt figure out how to modify to my purpose. I'm a noob in ML so I have no idea how to encode a one hot encoded array of things into a texture, if someone can help, i would be grateful!
     
  2. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
  3. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Oh nice I didnt know that. So say, I would have eg. floors and wall so (0,0) or (0,1), then I would write them and i write two tiles like so: [0,0,0] = 1 and [0,1,1] = 1, meaning there was a a floor present and a wall present?
     
  4. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    yeah, for example, right, assuming feature plane 0 is floor, and feature plane 1 is wall
     
  5. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    This is a naive implementation, but is the idea correct?

    Code (CSharp):
    1. public int Write(ObservationWriter writer)
    2.     {
    3.         int numWritten = 0;
    4.  
    5.         for (var h = map.height - 1; h >= 0; h--)
    6.         {
    7.             for (var w = 0; w < map.width; w++)
    8.             {
    9.                 var coord = new Vector2Int(h, w);
    10.                 var t = map.GetTile(coord);
    11.  
    12.                 if (t.position == map.end)
    13.                 {
    14.                     writer[h, w, 0] = 1;
    15.                     writer[h, w, 1] = 0;
    16.                     writer[h, w, 2] = 0;
    17.                     writer[h, w, 3] = 0;
    18.                 }
    19.                 else if (t.position == agent.currentPosition)
    20.                 {
    21.                     writer[h, w, 0] = 0;
    22.                     writer[h, w, 1] = 1;
    23.                     writer[h, w, 2] = 0;
    24.                     writer[h, w, 3] = 0;
    25.                 }
    26.                 else if (t.wall)
    27.                 {
    28.                     writer[h, w, 0] = 0;
    29.                     writer[h, w, 1] = 0;
    30.                     writer[h, w, 2] = 1;
    31.                     writer[h, w, 3] = 0;
    32.                 }
    33.                 else
    34.                 {
    35.                     writer[h, w, 0] = 0;
    36.                     writer[h, w, 1] = 0;
    37.                     writer[h, w, 2] = 0;
    38.                     writer[h, w, 3] = 1;
    39.                 }
    40.  
    41.                 numWritten += 4;
    42.             }
    43.         }
    44.  
    45.         return numWritten;
    46.     }
     
  6. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Yes I think so. Looks good, I think. But reminder: mlagents backend requires at least h>=20, w >= 20, I think (though you could add padding of course).
     
  7. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    (adding padding would probably be a reasonable thing to do; it will make things run a bit more slowly, but you will get the prior on adjacency you are looking for. the net will learn to just ignore the padded bits. Anyway, since they are zero they wont do anything. You will probably need to run on a GPU, otherwise each PPO learning phase might take ~5 minutes or so potentially).
     
  8. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Yes, I'm testing now with just looking at the whole map 20x20. The thing is, is the agent able to learn that it is indeed the player here, because it is no longer the "center" of the observation?
     
  9. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Totally. Drawing the position of the agent in a separate feature plane is one of the best ways to show the AI where they are.
     
  10. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Hmm, doesnt seem to be learning anything after 100k steps, maybe I'm doing something wrong.
     
  11. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Is it exploring, and sometimes attaining the goal? It can only learn if it sometimes gets a reward.
     
  12. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Code (CSharp):
    1. using UnityEngine;
    2. using Unity.MLAgents;
    3. using Unity.MLAgents.Sensors;
    4.  
    5. public class MapSensor : ISensor
    6. {
    7.     public MouseAgent agent;
    8.     public Map map;
    9.     public Tile[] cache;
    10.     public int channelCount;
    11.     public byte[] GetCompressedObservation()
    12.     {
    13.         return null;
    14.     }
    15.  
    16.     public CompressionSpec GetCompressionSpec()
    17.     {
    18.         return CompressionSpec.Default();
    19.     }
    20.  
    21.     public string GetName()
    22.     {
    23.         return "MapSensor";
    24.     }
    25.  
    26.     public ObservationSpec GetObservationSpec()
    27.     {
    28.         return ObservationSpec.Visual(map.height, map.width, channelCount, ObservationType.Default);
    29.     }
    30.  
    31.     public void Reset()
    32.     {
    33.  
    34.     }
    35.  
    36.     public void Update()
    37.     {
    38.  
    39.     }
    40.  
    41.     public void WriteOneHot(ObservationWriter writer, int h, int w, int channel, ref int numWritten)
    42.     {
    43.         for (int i = 0; i < channelCount; i++)
    44.         {
    45.             writer[h, w, i] = i == channel ? 1f : 0f;
    46.             numWritten++;
    47.         }
    48.     }
    49.  
    50.     public int Write(ObservationWriter writer)
    51.     {
    52.  
    53.         int numWritten = 0;
    54.  
    55.         for (var h = 0; h < map.height; h++)
    56.         {
    57.             for (var w = 0; w < map.width; w++)
    58.             {
    59.                 var coord = new Vector2Int(h, w);
    60.                 var t = map.GetTile(coord);
    61.  
    62.                 if (t.position == map.end)
    63.                 {
    64.                     WriteOneHot(writer, h, w, 0, ref numWritten);
    65.                     continue;
    66.                 }
    67.  
    68.                 if (t.position == agent.currentPosition)
    69.                 {
    70.                     WriteOneHot(writer, h, w, 1, ref numWritten);
    71.                     continue;
    72.                 }
    73.  
    74.                 if (t.wall)
    75.                 {
    76.                     WriteOneHot(writer, h, w, 2, ref numWritten);
    77.                     continue;
    78.                 }
    79.  
    80.                 if (agent.visited.ContainsKey(t))
    81.                 {
    82.                     WriteOneHot(writer, h, w, 3, ref numWritten);
    83.                     continue;
    84.                 }
    85.  
    86.                 WriteOneHot(writer, h, w, 4, ref numWritten);
    87.             }
    88.         }
    89.  
    90.         return numWritten;
    91.     }
    92. }
    93.  
    I'm able to get to some rewards with this, but I suspect if it is indeed observing anything, as it goes around in large circles, trying to find the end. Feels like its blind. Is there something I'm missing here? The map class and the agent should be working properly as I've tested them before with simple vector observations. Im using a very sparse reward scheme:
    Code (CSharp):
    1. public override void OnActionReceived(ActionBuffers actions)
    2.     {
    3.         var t = map.GetTile(currentPosition);
    4.  
    5.         if (!visited.ContainsKey(t))
    6.         {
    7.             visited.Add(t, visitedStep);
    8.         }
    9.  
    10.         var dist = Vector2Int.Distance(currentPosition, map.end);
    11.  
    12.         if (dist < 3)
    13.         {
    14.             success = true;
    15.             AddReward(1f);
    16.             EndEpisode();
    17.             return;
    18.         }
    19.  
    20.  
    21.         var move = actions.DiscreteActions[0];
    22.         map.Move(ref currentPosition, move);
    23.         AddReward(-1f / MaxStep);
    24.     }
    edit: tried looking at this: https://github.com/Unity-Technologi...ests/Runtime/Sensor/FloatVisualSensorTests.cs
     
  13. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Ok, so I actually ditched the multi channel approach and instead went for multiple sensors with each having a one channel, and now it seems to be working and learning!

    Code (CSharp):
    1. public override ISensor[] CreateSensors()
    2.     {
    3.         var s1 = new MapSensor(agent, map, channels, "MapSensor.End", isEnd);
    4.         var s2 = new MapSensor(agent, map, channels, "MapSensor.Player", isPlayer);
    5.         var s3 = new MapSensor(agent, map, channels, "MapSensor.Wall", isWall);
    6.         var s4 = new MapSensor(agent, map, channels, "MapSensor.Floor", isFloor);
    7.         return new ISensor[] { s1, s2, s3, s4 };
    8.     }
    9.  
    10.     public bool isEnd(Tile t)
    11.     {
    12.         return t.position == map.end;
    13.     }
    14.  
    15.     public bool isPlayer(Tile t)
    16.     {
    17.         return t.position == agent.currentPosition;
    18.     }
    19.  
    20.     public bool isWall(Tile t)
    21.     {
    22.         return t.wall;
    23.     }
    24.  
    25.     public bool isFloor(Tile t)
    26.     {
    27.         return !t.wall;
    28.     }
    Previously I was using just one sensor, seems either that unity or python doesnt support this many channels, and is probably capped at 4 for RGBA images.
     
  14. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Nice!

    > seems either that unity or python doesnt support this many channels

    Nit: mlagents is an application written using C#, Unity, and Python. But mlagents is not itself equal to either 'Unity' or 'Python', and many limitations of mlagents are not fundamental limitations of either Unity or Python.

    (For example, I have an alternative approach to linking Unity and Python, and running RL, here
    )
     
  15. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Do you have any suggestion for batch size, buffer size and time horizon for this kind of maze problem? I'm still kinda confused about those, been using batch size 32, buffer 3200 and time horizon 32 and episode max steps 256.

    Should the time horizon == maxsteps in this problem, to capture the whole trajectory in each episode? I'm confused of what "experience" means in this context, is one experience just a collection of the agents observations + actions in one decision step or frame?
     
  16. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Your batch size and buffer size sound reasonable to me.

    Trade off with buffer size:
    - should be several times longer than the time between rewards, ideally, to keep gradients stable
    - if too long, then learning only takes place infrequently, so learning is slower

    I usually use stable baselines 3 default of 2048, and buffer size 64, https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
     
  17. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    and your max steps sounds plausible. Though you want to make sure it's long enough that it can solve the task randomly sometimes. The trade off is:
    - if too short, then it might not have time to solve the game randomly
    - if too long, and it gets stuck, then it's just wasting time, since it won't learn from such episodes
     
  18. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
  19. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    I'm not sure about time horizon. stable baselines 3 implementation (which Is what I normally use) doesnt have something called time horizon. Some sources say it is equivaelnt to sbs3 n_steps, but I believe n_steps maps to buffer_size, so I'm not sure. I guess I'd try the default 64 to start with; and then try setting it to buffer_size, and compare the results.
     
  20. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    I mean max steps as in the amount of decisions before the episode is restarted, that is in behavior parameters. Im now getting better results with this than with the old config, have to continue experimenting:
    Code (CSharp):
    1. behaviors:
    2.   mouse:
    3.     trainer_type: ppo
    4.     hyperparameters:
    5.       batch_size: 256
    6.       buffer_size: 25600
    7.       learning_rate: 0.0003
    8.       beta: 0.005
    9.       epsilon: 0.2
    10.       lambd: 0.95
    11.       num_epoch: 3
    12.       learning_rate_schedule: constant
    13.     network_settings:
    14.       normalize: false
    15.       hidden_units: 128
    16.       num_layers: 2
    17.       vis_encode_type: simple
    18.     reward_signals:
    19.       extrinsic:
    20.         gamma: 0.99
    21.         strength: 1.0
    22.     keep_checkpoints: 5
    23.     max_steps: 5000000
    24.     time_horizon: 256
    25.     summary_freq: 25600
    26. engine_settings:
    27.   time_scale: 100
     
  21. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Yeah, the new config file looks good. You can also play with num_layers, and hidden_units. increasing num_layers to 2, and hidden_units to 256, or 512, will make the neural network 'brain' bigger. It will run more slowly, but can learn more complicated functions.
     
  22. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    I was able to get around 0.95 mean reward with some tuning, but it still gets stuck in some places. Is this because I change the maze all the time? I might've over estimated the capabilities of this kind of learning, what i've read online is that most RL algorithms expect the environment to be stationary.
     
  23. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    the distribution of the environment should be stationary. This is not the same thing as saying that the environment should not change.

    What does 'distribution' mean? And how can a distribution be 'stationary'?

    Let's say you have a dice. You roll it a few times. You get out 1, 1, 3, 1, 6, 5. But if you roll it enough times, about one sixth of the time, you'll roll a 1. one sixth of the time you'll roll a 2. And so on. The distiribution for this dide is the proportion of the time that you'll roll each number.

    The distribution of a dice is stationary: it doesn't change with time. Although each time you roll the dice you get a different number, the proportion of the rolls that is 1 doesnt change much. It's always about 1/6.

    Now, imagine a dice where each roll you shave some wood off the sides of some of the faces, so that the 1 gradually becomes less likely (maybe you give the 1 a smaller face). Now the distribution is changing. It's no longer stationary.

    In RL, one way to get a non-stationary distribution is to get an RL agent to play against another RL agent, e.g. to play itself.

    In your case, generating random environments means that you are drawing each environment from a certain distributioin. But that dsitriubtion is fixed. Stationary.

    That doesnt mean your RL will learn perfectly. To learn perfectly, you have to train them for ages. And they need a sufficiently large network. But there's no theroetical reason why an RL cannot learn to run against a stable distribution of randomly generated environemnts.
     
  24. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    ah, so stationary means that things can be different, but they must be like identically different or different in the same way?
     
  25. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Yeah, exactly. (I mean, 'identically different' is not a technical definition, but I know what you mean :) )

    Note that just because we change the distribution doesn't necessarily break the agent. For example, a curriculum, where we start by giving easy tasks, then gradually change to harder tasks, is a change of distribution over time.

    Examples of where we don't change the distribution:
    - agent plays only Pacman (distribution is: Pacman game)
    - agent plays one of 50 different Atari games (distribution is: all 50 atari games; even though there are 50 different games, but we are drawing the choice of game from the same distribution each time)

    Example of where we change the distribution
    - train the agent on Pacman, then test the agent on Pong (initial distribution is Pacman game. Later distribution is Pong game)

    If an agent plays itself, then its opponent gets better with time, changes its behavior. This means that things that used to work to win no longer work. This can be quite unstable. Think of this like: you play Starcraft, and you zerg rush your opponent, and you win. The next time you play, your opponent might expect you to zerg rush, and, if you zerg rush again, this time, you die.
     
  26. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    I've been going on a whole ML journey now, and ran across Alpha Zero and the elegant hack they do with MCTS, and it has gotten me thinking about how to do a similar thing in mlagents. Say, in this maze problem, I would run MCTS to hopefully find the end, then I would compare the action that MCTS gave vs. the neural networks and reward the agent according to how close it came. Would then the agent learn to have an "intuitive" mcts, when it sees new inputs?
     
  27. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    You don't need MCTS to get the agent to learn intuition. It's sufficient to train the agent based on observations, using e.g. PPO.

    MCTS is used for Go, because you don't know what the opponent is going to do. And they are going to do depends on what you do. And what they do next depends on what they think you will do after that. And what they think you will do after that depends on what they think you will do after their next move after your next move...

    That's the 'tree search' bit.

    The monte carlo bit is because there's no obvious way to 'score' positions in Go. For chess, you can count the number of pieces, give each piece a value. Pawn is 1. Queen is 9. Something like that. You can only search forward 4-5 moves or so in chess, otherwise it just takes too long. So you search 4-5 moves ahead, huge tree of possible moves, and choose the move that gives the highest value (most pieces for you, least pieces for the opponent.

    So, for Go, what they do, to score a board configuration, is to simply run a bunch of semi-random games, and see who wins. Do this enough times (e.g. 200 or something), and then this gives some kind of a measure of the 'score' of a position.

    so, MCTS is like:
    - build a tree, for a few moves deep
    - for the leaves in the tree, just play randomly, to the end, a few hundred times, and score that position according to number of times you won

    Then, for AlphaGo, per my understanding, a convolutional neural network is thrown into the mix, to guide which parts of the tree to build, though my understanding gets a little hazy at this point, so this last sentence might not be quite right.

    In your case, the maze is a stationary distribution, so no need to run MCTS I think?
     
  28. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
  29. topitsky

    topitsky

    Joined:
    Jan 12, 2016
    Posts:
    100
    Ill have to test out if MCTS could help in these kind of problems, in case you're interested, this was a very good write up on the inner workings of alpha zero: https://joshvarty.github.io/AlphaZero/ and https://web.stanford.edu/~surag/posts/alphazero.html
     
    hughperkins likes this.