Hi, I've seen a python error message saying the minimum width/height for "vis_encode_type: simple" is 20. Is this an arbitrary number, or are there computational reasons for that? I'm writing a 64x4 band of depth texture values into a render texture, and the min limit prevents me from using it as a sensor. It's not a problem to increase the texture size, but for the sake of low memory and processing usage, I'm trying to keep things as small and simple as possible. Would it be preferable to implement a 2D float sensor for my use case instead? Thanks!
Anything smaller than those values would result in a tensorflow error (I'm fuzzy on the details, but something about the strides in the convolutional layer). FWIW, the min dimensions for resnet is 15, so that might be enough for you. I'll log a feature request to support smaller dimensions too. I'm not sure that a 2D float sensor would be any better; it's still subject to the same dimension restrictions.
The problem is that if you pool and convolve too many times on a small image, it will end up with negative or puny dimensions on the other end. There's also an issue with the divisibility of texture resolutions. Only certain resolutions won't result in padding being added to the edges of the image (not a big deal though...)
Do the hidden layers only consider the resulting image after all convolution and pooling steps? Or do they also work with intermediate data? I have another use case where I'm training an agent to map a dungeon with different room sizes. The visual observation here is a top-down orthographic 32x32 b/w view of the mapped area. It starts out all black and as the agent moves around the rooms, grid cells get filled with grey values representing their accessibility (number of neighbouring walls). The agent receives a discrete reward for each new cell it detects. So far, I have been training it for 30M steps and slowly but surely, it gets more efficient overall. However, occasionally the agent still gets stuck in a looping pattern although it should be able to see a nearby exit to yet unmapped cells. My suspicion is that the agent is somewhat aware of the overall layout, but fails to detect critical details like a 2px wide gap representing a doorway. Are convolution and pooling doing more harm than good in this case?
Only the final encoded result. That sounds plausible, but I'm not at all sure if thin features are sometimes biased out or not. to little experience with CNNs