Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question Atomic swap on more than one number?

Discussion in 'Shaders' started by JollyTheory, Mar 11, 2023.

  1. JollyTheory

    JollyTheory

    Joined:
    Dec 30, 2018
    Posts:
    153
    I am trying to implement an in-place falling sand algorithm, let's say i have a RWTexture3D<uint> and i want all spots that are not zero to move one spot down if that spot is empty (0). That is easy to do:

    Let's say 'from' is xyz, 'to' is xyz + (0, -1, 0)
    Code (CSharp):
    1. uint toValueWas;
    2. InterlockedCompareExchange(_texture[to], 0, _texture[from], toValueWas);
    3. if (toValueWas == 0)
    4. {
    5.     _texture[from] = toValueWas;
    6. }
    This works perfectly, but what if i need more storage than one uint? Let's say i want to have another texture b that will give me extra storage and which moves i want to keep in sync with the original texture. I have tried every which way to do this, but the two textures/buffers always go out of sync. For example this doesn't work:
    Code (CSharp):
    1. uint toValueWas;
    2. InterlockedCompareExchange(_texture[to], 0, _texture[from], toValueWas);
    3. if (toValueWas == 0)
    4. {
    5.     _texture[from] = toValueWas;
    6.     // Goes out of sync with _texture:
    7.     InterlockedExchange(_textureB[to], _textureB[from], _textureB[from]);
    8. }
    Is there a way to effectively do atomic swaps on more than 32 bits?
     
  2. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,539
    RWStructuredBuffer<structContainingValues>
    Or, RWTexture3D<uint4> for 4 values for example
     
    Last edited: Mar 12, 2023
  3. JollyTheory

    JollyTheory

    Joined:
    Dec 30, 2018
    Posts:
    153
    Yeah but atomic functions (Interlocked*) only work on uint1/int1
     
  4. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    I don't think so. You might need to get creative.

    One thing I can think of is a 2-pass solution. Store the source index (a single uint; might need to bit-pack them if you have 2 coordinates) using interlocked into a buffer. Now each destination index has one source index. Then on your second pass, read the source data and actually apply it.

    Or pack up indices with a set of deltas into an append buffer, and on your second pass apply the operations in that append buffer.
     
  5. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,539
    Oh sorry, right. I've actually done some falling pixel experiments before and the approach I took was to double-buffer it and you don't need to worry about interlocking. A Read and a Write buffer, and the Read becomes the Write after it's done.
     
  6. JollyTheory

    JollyTheory

    Joined:
    Dec 30, 2018
    Posts:
    153
    Hmm, great idea with atomicing indexes instead of actual values.
    I've tried implementing it (3 kernels) and to my surprise the two textures still get desynced. Something's fishy, time to try to debug this for a millionth time i guess..
     
  7. b0nes123

    b0nes123

    Joined:
    Nov 6, 2019
    Posts:
    28
    You can use
    DeviceMemoryBarrierWithGroupSync()
    to synchronize threads within a single threadblock/workgroup. With some clever partitioning of your "sand buffer" (each threadblock gets a certain partition of the buffer to iterate), this should work.

    That being said, what is the granularity of synchronization? Is it really necessary to synchronize every single swap, or can you do a little work, then synchronize, then repeat. As burningmime said, working in "2 passes" is much simpler, and much more performant, than copying every swap.
     
  8. JollyTheory

    JollyTheory

    Joined:
    Dec 30, 2018
    Posts:
    153
    By two passes do you mean ping-ponging textures or something else?
    My problem is that I'm trying to create as large a sim as possible and I am using 2gb buffers, with ping-ponding it would baloon to double the memory and I'm trying to avoid that. (For other reasons I also need it to be all in a single big buffer instead of say loading a bit closest to cam from a disk. I need infinite viewrange)
    After working on this some more I am now not even sure why the original single texture atomic swap worked. There's definitely some GPU mechanics going on that complicate things. The idea with an appendbuffer of indexes won't work because the order of the second kernel processing it will not be the same as the kernel appending to it.
    What I don't understand is: after an atomic operation, do the other threads, in other threadgroups as well, manage to see the change caused by the atomic? If so then why would two atomics on different textures ever go out of sync? Any way I imagine it getting executed, in lockstep/in other groups which GPU didn't have enough cores to handle in parallel, I just don't see any way of why they would be going out of sync. And yet they do.
     
  9. b0nes123

    b0nes123

    Joined:
    Nov 6, 2019
    Posts:
    28
    Hmm, sounds like you are misunderstanding how the GPU functions. Only a warp of a threadgroup will execute in lockstep. Other than that, generally speaking, the GPU is free to schedule threadgoups on whichever SM it chooses, and within those threadgroups, the order of the execution of the warps is also not guaranteed.

    An atomic function does not guarantee an "order of execution" between threads, only that a consistent view of the data is maintained. A possible reason why your atomic is not working is that the GPU caches values from global memory to make them easier to access. Using an atomic function only causes the cache to flush within the current threadgroup. This can be remedied by prefixing the buffer with
    globallycoherent
    , however without seeing the rest of the code, I cannot guarantee this will work. Keep in mind that this may have large performance penalty, as you are essentially eliding the cache.
     
  10. JollyTheory

    JollyTheory

    Joined:
    Dec 30, 2018
    Posts:
    153
    The two textures have been marked globallycoherent this whole time, still doesn't work unfortunately.
     
  11. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    Why, though? If this for a game or film, you can almost certainly find ways to approximate things. One easy step would be to scale the simulation up and down dynamically. So the area directly around the camera gets full res, and further out gets quarter res or something. As the camera moves, you re-assign buckets, combining or splitting parcels of sand as needed.

    If you really need the whole simulation, can you at least chunk it so that you're only dealing with a small portion of the overall texture, then do a pass for the boundaries where the chunks meet?

    If they are also accessing it via atomic operations, and they're part of the same kernel/dispatch they will see it. If you're sometimes accessing it via atomics and sometimes not, then stuff can break without careful flushes.
     
  12. b0nes123

    b0nes123

    Joined:
    Nov 6, 2019
    Posts:
    28
    To add onto what burningmime said, just because you are using
    globallycoherent
    and using atomics does not necessarily guarantee coherency. I know it sounds weird, because that's basically the point of atomics, but using atomics on data accessed by multiple threadblocks/groups is known to *sometimes* fail to maintain coherency (fail message-passing atomic litmus test). The more pressure you put on the memory transfers the more likely you are to break coherency.

    Unfortunately, there isn't a way around this except by specifically designing your code with this in mind.
     
  13. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    I didn't mean implementation issues -- I was specifically referring to the fact that GPUs have an "atomic unit" which will keep coherency only when you access through there. What this means is that if you have...

    Code (CSharp):
    1.  
    2. uint x, y;
    3. InterlockedAdd(myBuffer[i], 100u, x); // access via atomic unit
    4. unit y = myBuffer[i]; // access via normal memory
    5. if(x + 100u != y)
    6. {
    7.     // even if *no other thread* accessed it, this could still be true, because the atomic unit's cache
    8.     // may not have been flushed to the regular cache or main memory
    9. }
     
    b0nes123 likes this.
  14. b0nes123

    b0nes123

    Joined:
    Nov 6, 2019
    Posts:
    28
    Ah I see, what I was talking about is that even atomic loading like this can fail:

    Code (CSharp):
    1.  
    2. uint x;
    3. groupshared uint y[GROUP_SIZE]; //use shared memory because you cannot use local memory as atomic destination
    4.  
    5. InterlockedAdd(myBuffer[i], 100u, x);
    6. y[group_thread_id] = 0;
    7. InterlockedOr(y[group_thread_id], myBuffer[i])
    8.  
    9. if(x + 100u != y[group_thread_id])
    10. {
    11.     //this can still happen in certain situations
    12. }
     
    Last edited: Mar 15, 2023
  15. JollyTheory

    JollyTheory

    Joined:
    Dec 30, 2018
    Posts:
    153
    Okay so after about a week of bashing my head against this I think i finally fixed it by having a bit in texture A that says "has this one been moved yet?", reset in a different kernel at start of every update. Basically restricting the atomic moves to one move per kernel execution, removing the possibility of following construct occurring which complicates things massively:
    (imagine were swapping everything one index to the right if its empty)
    thread 0 happens to get executed first: swaps [0] and [1], then thread 1 happens to get executed next: picks up right after and sees that it already can do yet another swap from [1] to [2], etc.
    Final working code is:
    Code (CSharp):
    1. uint _textureFrom = _texture[from];
    2.  
    3. if (bit(_textureFrom, BEEN_MOVED_BIT) == 1) return;
    4.  
    5. _textureFrom = setbit(_textureFrom, BEEN_MOVED_BIT, 1);
    6.  
    7. uint toValueWas;
    8. InterlockedCompareExchange(_texture[_to], 0, _textureFrom, toValueWas);
    9. if (toValueWas == 0)
    10. {
    11.     _texture[from] = 0;
    12.  
    13.     uint __;
    14.     InterlockedExchange(_textureB[to], _textureB[from], __);
    15. }
    My guess at the source of the desync was that while the Interlocked* functions lines are indeed perfectly atomic, everything else is not atomic... idk, the more i think about it the more it still doesn't really make sense, but this works. At the very least I can say that one of the keys to the desync happening definitely was the situation where more than one swap was happening in quick succession.
     
  16. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    That seems like an implementation bug (hardware/driver/etc). If it's specced to allow that when no other threads are accessing the same variable, then atomic operations would become essentially useless (and stuff like tile-based light culling wouldn't work).
     
  17. b0nes123

    b0nes123

    Joined:
    Nov 6, 2019
    Posts:
    28
    Using kernel launches like this makes them in effect device-wide memory barriers. While this will work, and I am in no way trying to put you down, it is extremely inefficient. The reason why your atomics are not working is because you have multiple different threadgroups reading and writing to the same global data.

    As I mentioned previously, the solution to this is to partition your data such that you guarantee that each thread group only accesses a disjoint partition of the buffer like so:

    Code (CSharp):
    1.  
    2. #define GROUP_SIZE 64
    3. #define THREAD_BLOCKS 64
    4.  
    5. extern uint partitionSize; //set this on the cpu side (bufferLength / THREAD_BLOCKS)
    6.  
    7. [numthreads(GROUP_SIZE, 1, 1)]
    8. void kernel(int3 gtid : SV_GroupThreadID, int3 gid : SV_GroupID)
    9. {
    10.     int partitionStart = gid.x * partitionSize;
    11.     int partitionEnd = (gid.x + 1) * partitionSize;
    12.  
    13.     for (int i = gtid.x + partitionStart; i < partitionEnd; i += GROUP_SIZE)
    14.     {
    15.         //your code here
    16.         //Atomics should not be needed
    17.         myBuffer[i];
    18.         DeviceMemoryBarrierWithGroupSync();
    19.     }
    20. }
    21. }
    Barring this, if your sand falling algorithm is as simple as you say (decrementing a single value in a buffer), another surefire way to keep the buffers synchronized is to execute the algorithm 1 step behind the first buffer on the second buffer. I guarantee you, this will be faster than dispatching a kernel for every iteration of your algorithm.
     
    Last edited: Mar 16, 2023