Question 2D array sensor

topitsky · Jan 27, 2023

I have simple maze problem that I've gotten working somewhat with just vector observations. The agent basically gets a one hot encoding of its surrounding from a 5x5 array, containing the types of tiles (wall,end,etc.). It gets rewarded when it finds the end, and 1f/MaxSteps when it discovers a new tile, and this scheme seems to be working quite well. It can get to mean rewards upwards to 0.95f, but the standard deviation is still poor, and it tends to get stuck in places where there are many walls. I also use discrete action masking, the agent can go in 4 dirs if theyre acceptable.

Then I discovered that presenting info in this vector format, essentially 1d can be confusing for the agent, and then started looking into grid sensor. The grid sensor works with collisions, and my tilemap is just a class with numbers. Is there a way to "hack" the ISensor interface so that it treats the incoming data as visual? I've looked at this https://github.com/mbaske/grid-sensor but couldnt figure out how to modify to my purpose. I'm a noob in ML so I have no idea how to encode a one hot encoded array of things into a texture, if someone can help, i would be grateful!

hughperkins · Jan 28, 2023

One thing to note is that, AFAIK, the mlagents back-end requires visual observations to be at least 20x20 resolution, but it sounds like your resolution is much lower than this?

You can set the values of visual observations by using the ObservationWriter [int32, int32, int32] method https://docs.unity3d.com/Packages/c...Unity.MLAgents.Sensors.ObservationWriter.html

topitsky · Jan 28, 2023

hughperkins said: ↑

One thing to note is that, AFAIK, the mlagents back-end requires visual observations to be at least 20x20 resolution, but it sounds like your resolution is much lower than this?

You can set the values of visual observations by using the ObservationWriter [int32, int32, int32] method https://docs.unity3d.com/Packages/c...Unity.MLAgents.Sensors.ObservationWriter.html
Click to expand...

Oh nice I didnt know that. So say, I would have eg. floors and wall so (0,0) or (0,1), then I would write them and i write two tiles like so: [0,0,0] = 1 and [0,1,1] = 1, meaning there was a a floor present and a wall present?

hughperkins · Jan 28, 2023

yeah, for example, right, assuming feature plane 0 is floor, and feature plane 1 is wall

topitsky · Jan 29, 2023

This is a naive implementation, but is the idea correct?

Code (CSharp):

public int Write(ObservationWriter writer)

{

int numWritten = 0;

for (var h = map.height - 1; h >= 0; h--)

{

for (var w = 0; w < map.width; w++)

{

var coord = new Vector2Int(h, w);

var t = map.GetTile(coord);

if (t.position == map.end)

{

writer[h, w, 0] = 1;

writer[h, w, 1] = 0;

writer[h, w, 2] = 0;

writer[h, w, 3] = 0;

}

else if (t.position == agent.currentPosition)

{

writer[h, w, 0] = 0;

writer[h, w, 1] = 1;

writer[h, w, 2] = 0;

writer[h, w, 3] = 0;

}

else if (t.wall)

{

writer[h, w, 0] = 0;

writer[h, w, 1] = 0;

writer[h, w, 2] = 1;

writer[h, w, 3] = 0;

}

else

{

writer[h, w, 0] = 0;

writer[h, w, 1] = 0;

writer[h, w, 2] = 0;

writer[h, w, 3] = 1;

}

numWritten += 4;

}

}

return numWritten;

}

hughperkins · Jan 29, 2023

Yes I think so. Looks good, I think. But reminder: mlagents backend requires at least h>=20, w >= 20, I think (though you could add padding of course).

hughperkins · Jan 29, 2023

(adding padding would probably be a reasonable thing to do; it will make things run a bit more slowly, but you will get the prior on adjacency you are looking for. the net will learn to just ignore the padded bits. Anyway, since they are zero they wont do anything. You will probably need to run on a GPU, otherwise each PPO learning phase might take ~5 minutes or so potentially).

topitsky · Jan 29, 2023

Yes, I'm testing now with just looking at the whole map 20x20. The thing is, is the agent able to learn that it is indeed the player here, because it is no longer the "center" of the observation?

hughperkins · Jan 29, 2023

Totally. Drawing the position of the agent in a separate feature plane is one of the best ways to show the AI where they are.

topitsky · Jan 29, 2023

Hmm, doesnt seem to be learning anything after 100k steps, maybe I'm doing something wrong.

hughperkins · Jan 29, 2023

Is it exploring, and sometimes attaining the goal? It can only learn if it sometimes gets a reward.

topitsky · Jan 29, 2023

Code (CSharp):

using UnityEngine;

using Unity.MLAgents;

using Unity.MLAgents.Sensors;

public class MapSensor : ISensor

{

public MouseAgent agent;

public Map map;

public Tile[] cache;

public int channelCount;

public byte[] GetCompressedObservation()

{

return null;

}

public CompressionSpec GetCompressionSpec()

{

return CompressionSpec.Default();

}

public string GetName()

{

return "MapSensor";

}

public ObservationSpec GetObservationSpec()

{

return ObservationSpec.Visual(map.height, map.width, channelCount, ObservationType.Default);

}

public void Reset()

{

}

public void Update()

{

}

public void WriteOneHot(ObservationWriter writer, int h, int w, int channel, ref int numWritten)

{

for (int i = 0; i < channelCount; i++)

{

writer[h, w, i] = i == channel ? 1f : 0f;

numWritten++;

}

}

public int Write(ObservationWriter writer)

{

int numWritten = 0;

for (var h = 0; h < map.height; h++)

{

for (var w = 0; w < map.width; w++)

{

var coord = new Vector2Int(h, w);

var t = map.GetTile(coord);

if (t.position == map.end)

{

WriteOneHot(writer, h, w, 0, ref numWritten);

continue;

}

if (t.position == agent.currentPosition)

{

WriteOneHot(writer, h, w, 1, ref numWritten);

continue;

}

if (t.wall)

{

WriteOneHot(writer, h, w, 2, ref numWritten);

continue;

}

if (agent.visited.ContainsKey(t))

{

WriteOneHot(writer, h, w, 3, ref numWritten);

continue;

}

WriteOneHot(writer, h, w, 4, ref numWritten);

}

}

return numWritten;

}

}

I'm able to get to some rewards with this, but I suspect if it is indeed observing anything, as it goes around in large circles, trying to find the end. Feels like its blind. Is there something I'm missing here? The map class and the agent should be working properly as I've tested them before with simple vector observations. Im using a very sparse reward scheme:

Code (CSharp):

public override void OnActionReceived(ActionBuffers actions)

{

var t = map.GetTile(currentPosition);

if (!visited.ContainsKey(t))

{

visited.Add(t, visitedStep);

}

var dist = Vector2Int.Distance(currentPosition, map.end);

if (dist < 3)

{

success = true;

AddReward(1f);

EndEpisode();

return;

}

var move = actions.DiscreteActions[0];

map.Move(ref currentPosition, move);

AddReward(-1f / MaxStep);

}

edit: tried looking at this: https://github.com/Unity-Technologi...ests/Runtime/Sensor/FloatVisualSensorTests.cs

topitsky · Jan 30, 2023

Ok, so I actually ditched the multi channel approach and instead went for multiple sensors with each having a one channel, and now it seems to be working and learning!

Code (CSharp):

public override ISensor[] CreateSensors()

{

var s1 = new MapSensor(agent, map, channels, "MapSensor.End", isEnd);

var s2 = new MapSensor(agent, map, channels, "MapSensor.Player", isPlayer);

var s3 = new MapSensor(agent, map, channels, "MapSensor.Wall", isWall);

var s4 = new MapSensor(agent, map, channels, "MapSensor.Floor", isFloor);

return new ISensor[] { s1, s2, s3, s4 };

}

public bool isEnd(Tile t)

{

return t.position == map.end;

}

public bool isPlayer(Tile t)

{

return t.position == agent.currentPosition;

}

public bool isWall(Tile t)

{

return t.wall;

}

public bool isFloor(Tile t)

{

return !t.wall;

}

Previously I was using just one sensor, seems either that unity or python doesnt support this many channels, and is probably capped at 4 for RGBA images.

hughperkins · Jan 30, 2023

Nice!

> seems either that unity or python doesnt support this many channels

Nit: mlagents is an application written using C#, Unity, and Python. But mlagents is not itself equal to either 'Unity' or 'Python', and many limitations of mlagents are not fundamental limitations of either Unity or Python.

(For example, I have an alternative approach to linking Unity and Python, and running RL, here
)

topitsky · Jan 30, 2023

hughperkins said: ↑

Nice!

> seems either that unity or python doesnt support this many channels

Nit: mlagents is an application written using C#, Unity, and Python. But mlagents is not itself equal to either 'Unity' or 'Python', and many limitations of mlagents are not fundamental limitations of either Unity or Python.

(For example, I have an alternative approach to linking Unity and Python, and running RL, here
)
Click to expand...

Do you have any suggestion for batch size, buffer size and time horizon for this kind of maze problem? I'm still kinda confused about those, been using batch size 32, buffer 3200 and time horizon 32 and episode max steps 256.

Should the time horizon == maxsteps in this problem, to capture the whole trajectory in each episode? I'm confused of what "experience" means in this context, is one experience just a collection of the agents observations + actions in one decision step or frame?

hughperkins · Jan 30, 2023

Your batch size and buffer size sound reasonable to me.

Trade off with buffer size:
- should be several times longer than the time between rewards, ideally, to keep gradients stable
- if too long, then learning only takes place infrequently, so learning is slower

I usually use stable baselines 3 default of 2048, and buffer size 64, https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html

hughperkins · Jan 30, 2023

and your max steps sounds plausible. Though you want to make sure it's long enough that it can solve the task randomly sometimes. The trade off is:
- if too short, then it might not have time to solve the game randomly
- if too long, and it gets stuck, then it's just wasting time, since it won't learn from such episodes

hughperkins · Jan 30, 2023

wait... max_steps in mlagents means total training steps, so it should be like several million. https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md I dont think the trainer itself configures the maximum steps per episode. You have to configure that yourself, in your game design.

hughperkins · Jan 30, 2023

I'm not sure about time horizon. stable baselines 3 implementation (which Is what I normally use) doesnt have something called time horizon. Some sources say it is equivaelnt to sbs3 n_steps, but I believe n_steps maps to buffer_size, so I'm not sure. I guess I'd try the default 64 to start with; and then try setting it to buffer_size, and compare the results.

topitsky · Jan 30, 2023

I mean max steps as in the amount of decisions before the episode is restarted, that is in behavior parameters. Im now getting better results with this than with the old config, have to continue experimenting:

Code (CSharp):

behaviors:

mouse:

trainer_type: ppo

hyperparameters:

batch_size: 256

buffer_size: 25600

learning_rate: 0.0003

beta: 0.005

epsilon: 0.2

lambd: 0.95

num_epoch: 3

learning_rate_schedule: constant

network_settings:

normalize: false

hidden_units: 128

num_layers: 2

vis_encode_type: simple

reward_signals:

extrinsic:

gamma: 0.99

strength: 1.0

keep_checkpoints: 5

max_steps: 5000000

time_horizon: 256

summary_freq: 25600

engine_settings:

time_scale: 100

hughperkins · Jan 30, 2023

Yeah, the new config file looks good. You can also play with num_layers, and hidden_units. increasing num_layers to 2, and hidden_units to 256, or 512, will make the neural network 'brain' bigger. It will run more slowly, but can learn more complicated functions.

topitsky · Feb 6, 2023

hughperkins said: ↑

Yeah, the new config file looks good. You can also play with num_layers, and hidden_units. increasing num_layers to 2, and hidden_units to 256, or 512, will make the neural network 'brain' bigger. It will run more slowly, but can learn more complicated functions.
Click to expand...

I was able to get around 0.95 mean reward with some tuning, but it still gets stuck in some places. Is this because I change the maze all the time? I might've over estimated the capabilities of this kind of learning, what i've read online is that most RL algorithms expect the environment to be stationary.

hughperkins · Feb 6, 2023

the distribution of the environment should be stationary. This is not the same thing as saying that the environment should not change.

What does 'distribution' mean? And how can a distribution be 'stationary'?

Let's say you have a dice. You roll it a few times. You get out 1, 1, 3, 1, 6, 5. But if you roll it enough times, about one sixth of the time, you'll roll a 1. one sixth of the time you'll roll a 2. And so on. The distiribution for this dide is the proportion of the time that you'll roll each number.

The distribution of a dice is stationary: it doesn't change with time. Although each time you roll the dice you get a different number, the proportion of the rolls that is 1 doesnt change much. It's always about 1/6.

Now, imagine a dice where each roll you shave some wood off the sides of some of the faces, so that the 1 gradually becomes less likely (maybe you give the 1 a smaller face). Now the distribution is changing. It's no longer stationary.

In RL, one way to get a non-stationary distribution is to get an RL agent to play against another RL agent, e.g. to play itself.

In your case, generating random environments means that you are drawing each environment from a certain distributioin. But that dsitriubtion is fixed. Stationary.

That doesnt mean your RL will learn perfectly. To learn perfectly, you have to train them for ages. And they need a sufficiently large network. But there's no theroetical reason why an RL cannot learn to run against a stable distribution of randomly generated environemnts.

topitsky · Feb 7, 2023

hughperkins said: ↑

the distribution of the environment should be stationary. This is not the same thing as saying that the environment should not change.

What does 'distribution' mean? And how can a distribution be 'stationary'?

Let's say you have a dice. You roll it a few times. You get out 1, 1, 3, 1, 6, 5. But if you roll it enough times, about one sixth of the time, you'll roll a 1. one sixth of the time you'll roll a 2. And so on. The distiribution for this dide is the proportion of the time that you'll roll each number.

The distribution of a dice is stationary: it doesn't change with time. Although each time you roll the dice you get a different number, the proportion of the rolls that is 1 doesnt change much. It's always about 1/6.

Now, imagine a dice where each roll you shave some wood off the sides of some of the faces, so that the 1 gradually becomes less likely (maybe you give the 1 a smaller face). Now the distribution is changing. It's no longer stationary.

In RL, one way to get a non-stationary distribution is to get an RL agent to play against another RL agent, e.g. to play itself.

In your case, generating random environments means that you are drawing each environment from a certain distributioin. But that dsitriubtion is fixed. Stationary.

That doesnt mean your RL will learn perfectly. To learn perfectly, you have to train them for ages. And they need a sufficiently large network. But there's no theroetical reason why an RL cannot learn to run against a stable distribution of randomly generated environemnts.
Click to expand...

ah, so stationary means that things can be different, but they must be like identically different or different in the same way?

hughperkins · Feb 7, 2023

Yeah, exactly. (I mean, 'identically different' is not a technical definition, but I know what you mean )

Note that just because we change the distribution doesn't necessarily break the agent. For example, a curriculum, where we start by giving easy tasks, then gradually change to harder tasks, is a change of distribution over time.

Examples of where we don't change the distribution:
- agent plays only Pacman (distribution is: Pacman game)
- agent plays one of 50 different Atari games (distribution is: all 50 atari games; even though there are 50 different games, but we are drawing the choice of game from the same distribution each time)

Example of where we change the distribution
- train the agent on Pacman, then test the agent on Pong (initial distribution is Pacman game. Later distribution is Pong game)

If an agent plays itself, then its opponent gets better with time, changes its behavior. This means that things that used to work to win no longer work. This can be quite unstable. Think of this like: you play Starcraft, and you zerg rush your opponent, and you win. The next time you play, your opponent might expect you to zerg rush, and, if you zerg rush again, this time, you die.

topitsky · Feb 10, 2023

I've been going on a whole ML journey now, and ran across Alpha Zero and the elegant hack they do with MCTS, and it has gotten me thinking about how to do a similar thing in mlagents. Say, in this maze problem, I would run MCTS to hopefully find the end, then I would compare the action that MCTS gave vs. the neural networks and reward the agent according to how close it came. Would then the agent learn to have an "intuitive" mcts, when it sees new inputs?

hughperkins · Feb 11, 2023

You don't need MCTS to get the agent to learn intuition. It's sufficient to train the agent based on observations, using e.g. PPO.

MCTS is used for Go, because you don't know what the opponent is going to do. And they are going to do depends on what you do. And what they do next depends on what they think you will do after that. And what they think you will do after that depends on what they think you will do after their next move after your next move...

That's the 'tree search' bit.

The monte carlo bit is because there's no obvious way to 'score' positions in Go. For chess, you can count the number of pieces, give each piece a value. Pawn is 1. Queen is 9. Something like that. You can only search forward 4-5 moves or so in chess, otherwise it just takes too long. So you search 4-5 moves ahead, huge tree of possible moves, and choose the move that gives the highest value (most pieces for you, least pieces for the opponent.

So, for Go, what they do, to score a board configuration, is to simply run a bunch of semi-random games, and see who wins. Do this enough times (e.g. 200 or something), and then this gives some kind of a measure of the 'score' of a position.

so, MCTS is like:
- build a tree, for a few moves deep
- for the leaves in the tree, just play randomly, to the end, a few hundred times, and score that position according to number of times you won

Then, for AlphaGo, per my understanding, a convolutional neural network is thrown into the mix, to guide which parts of the tree to build, though my understanding gets a little hazy at this point, so this last sentence might not be quite right.

In your case, the maze is a stationary distribution, so no need to run MCTS I think?

hughperkins · Feb 11, 2023

(I hung out in computer-go for a few months, just before AlphaGo came out, https://web.archive.org/web/2015060...permail/computer-go/2014-December/author.html , though what I actually wrote seems to be lost in the mists of time... )

topitsky · Feb 11, 2023

hughperkins said: ↑

(I hung out in computer-go for a few months, just before AlphaGo came out, https://web.archive.org/web/2015060...permail/computer-go/2014-December/author.html , though what I actually wrote seems to be lost in the mists of time... )
Click to expand...

Ill have to test out if MCTS could help in these kind of problems, in case you're interested, this was a very good write up on the inner workings of alpha zero: https://joshvarty.github.io/AlphaZero/ and https://web.stanford.edu/~surag/posts/alphazero.html

Search Unity

Unity ID

Useful Searches

Question 2D array sensor