0

I have a compute shader in Vulkan, where I launch N threads with eight sub-threads each. Each subgroup of 8 threads share an eight element 'location' array. Each of the sub-threads calculate a number from 1 to 8 where there can be duplicates.The sub threads do no need to be ordered.

A sub-thread only needs to do work if it has calculated a number which is not a duplicate. So, each thread must read the location array to see if its number has already been processed. If its number is not in the array then it writes its number to the array and does work. Another thread comes along with the same number, reads the array, and sees that its number has been processed, and so does nothing.

I need the 'location' array to be shared and thread safe.

I tried to use subgroups to do this but it is not clear how the subgroup threads share writable data. Can I use broadcast? If so any ideas how? I've tried atomics and memory barriers without success.

3
  • This algorithm seems pretty sequential and not suited for GPUs. Besides, it looks like this causes race conditions : threads can write in the shared array while others are reading... If you do not want that and use synchronisation, this will make things much slower (even more sequential with most threads wasting time). If your goal is to remove duplicates, a common approach is to sort data and then remove contiguous duplicate with atomic accesses. The sort is the expensive part on modern GPUs. Commented Mar 11 at 23:51
  • By the way, the SIMD units of GPUs generally have at least 16 lanes. For example, on Nvidia GPUs, wavefronts (a.k.a. warp in CUDA) have 32 threads. On AMD, this is AFAIK 64. Thus, with 8 threads per subgroup the GPU should be clearly underused! I advise you use an approach using larger sub-groups if you could to increase the efficiency of the approach. Commented Mar 12 at 0:00
  • Thanks for your response, I am doing it in a loop now where one thread services all eight 'buckets". Its pretty fast now and I think doing it that way might even be slower. I'm convinced. If I could do it I would use 32 sub-threads to service four arrays. Commented Mar 12 at 0:59

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.