Having an issue with __shfl functions in HIP with ROCm #5478
-
|
I'm trying to write a HIP kernel that uses the __global__ void sum_kernel(int n, float const *__restrict__ x, int incx, float *__restrict__ out) {
int thread_id, num_threads;
thread_id = threadIdx.x + blockDim.x * blockIdx.x;
num_threads = blockDim.x * gridDim.x;
int inwarp_id = thread_id % warpSize;
int warp_id = thread_id / warpSize;
int num_warps = num_threads / warpSize;
int parallel_elems = n / num_threads;
int remaining_elems = n % num_threads;
float sum = 0.0f;
float const *x_curr = x + (ptrdiff_t)warp_id * warpSize * parallel_elems * incx;
for(int i = inwarp_id; i < parallel_elems * warpSize; i += incx * warpSize) {
sum += x_curr[i];
}
if(thread_id < remaining_elems) {
sum += x[thread_id + num_warps * warpSize * parallel_elems * incx];
}
// A print statement shows that all of the accumulators hold the appropriate values up to this point.
#pragma unroll
for(uint32_t i = warpSize / 2; i >= 1; i /= 2) {
// Printing what I get from the shuffle operation always gives zero.
sum = sum + __shfl_down(sum, i);
}
__syncthreads();
// At this point, the accumulators haven't changed, as if the reduction step never happened.
if(inwarp_id == 0) {
atomicAdd(out, sum);
}
}I'm using an AMD Radeon RX 7900 XTX on Debian 12 with ROCm 6.4.2 from the AMD Apt repo. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Hi @cgbriggs99, I tried the code snippet you provided, it seems to be working fine though. |
Beta Was this translation helpful? Give feedback.
-
|
I looked at your response and immediately knew what was wrong. I got the launch parameters in the wrong order. Thanks for the reply. |
Beta Was this translation helpful? Give feedback.
Hi @cgbriggs99, I tried the code snippet you provided, it seems to be working fine though.
Here is the host code I used to launch, perhaps something went wrong in your memory allocation and copying?