Having an issue with __shfl functions in HIP with ROCm #5478

cgbriggs99 · 2025-10-07T22:44:39Z

cgbriggs99
Oct 7, 2025

I'm trying to write a HIP kernel that uses the __shfl_down function, but no matter what I try, it always returns zero. Here's the minimal code that I am running that doesn't work. Is there something about this function that I just don't understand? The reduction loop is more or less copy-pasted from the ROCm documentation on the warpSize variable, so I would hope that it would work.

__global__ void sum_kernel(int n, float const *__restrict__ x, int incx, float *__restrict__ out) {
    int thread_id, num_threads;
    thread_id = threadIdx.x + blockDim.x * blockIdx.x;
    num_threads = blockDim.x * gridDim.x;
    int inwarp_id = thread_id % warpSize;
    int warp_id = thread_id / warpSize;
    int num_warps = num_threads / warpSize;
    int parallel_elems = n / num_threads;
    int remaining_elems = n % num_threads;
    
    float sum = 0.0f;
    float const *x_curr = x + (ptrdiff_t)warp_id * warpSize * parallel_elems * incx;
    for(int i = inwarp_id; i < parallel_elems * warpSize; i += incx * warpSize) {
        sum += x_curr[i];
    }

    if(thread_id < remaining_elems) {
        sum += x[thread_id + num_warps * warpSize * parallel_elems * incx];
    }

    // A print statement shows that all of the accumulators hold the appropriate values up to this point.
#pragma unroll
    for(uint32_t i = warpSize / 2; i >= 1; i /= 2) {
        // Printing what I get from the shuffle operation always gives zero.
        sum = sum + __shfl_down(sum, i);
    }
    __syncthreads();
    // At this point, the accumulators haven't changed, as if the reduction step never happened.
    if(inwarp_id == 0) {
        atomicAdd(out, sum);
    }
}

I'm using an AMD Radeon RX 7900 XTX on Debian 12 with ROCm 6.4.2 from the AMD Apt repo.

Answered by amd-nicknick

Oct 27, 2025

Hi @cgbriggs99, I tried the code snippet you provided, it seems to be working fine though.
Here is the host code I used to launch, perhaps something went wrong in your memory allocation and copying?

#include <hip/hip_runtime.h>
#include <iostream>

#define HIP_CHECK(condition)                                                                       \
    {                                                                                              \
        const hipError_t error = condition;                                                        \
        if (error != hipSuccess)                                                                   \
        {                                   …

View full answer

amd-nicknick · 2025-10-27T10:40:52Z

amd-nicknick
Oct 27, 2025
Collaborator

Hi @cgbriggs99, I tried the code snippet you provided, it seems to be working fine though.
Here is the host code I used to launch, perhaps something went wrong in your memory allocation and copying?

#include <hip/hip_runtime.h>
#include <iostream>

#define HIP_CHECK(condition)                                                                       \
    {                                                                                              \
        const hipError_t error = condition;                                                        \
        if (error != hipSuccess)                                                                   \
        {                                                                                          \
            std::cerr << "[HIP ERROR]: \"" << error << " " << hipGetErrorString(error) << "\" at " \
                      << __FILE__ << ':' << __LINE__ << std::endl;                                 \
            std::exit(-1);                                                                         \
        }                                                                                          \
    }

#define SIZE 4096

int main()
{
    float i_data[SIZE];
    float o_data[1] = {0};

    float *d_i_data;
    float *d_o_data;
    HIP_CHECK(hipMalloc((void **)&d_i_data, sizeof(i_data)));
    HIP_CHECK(hipMalloc((void **)&d_o_data, sizeof(o_data)));

    float sum = 0;
    for (int i = 0; i < sizeof(i_data) / sizeof(i_data[0]); i++)
    {
        i_data[i] = i;
        sum += i;
    }

    std::cout << std::fixed << "CPU sum: " << sum << std::endl;

    HIP_CHECK(hipMemcpy(d_i_data, i_data, sizeof(i_data), hipMemcpyHostToDevice));

    sum_kernel<<<4, 256>>>(SIZE, d_i_data, 1, d_o_data);
    HIP_CHECK(hipDeviceSynchronize());

    HIP_CHECK(hipMemcpy(o_data, d_o_data, sizeof(o_data), hipMemcpyDeviceToHost));

    std::cout << std::fixed << "GPU sum: " << o_data[0] << std::endl;

    HIP_CHECK(hipFree(d_i_data));
    HIP_CHECK(hipFree(d_o_data));

    return 0;
}

0 replies

cgbriggs99 · 2025-10-28T16:17:26Z

cgbriggs99
Oct 28, 2025
Author

I looked at your response and immediately knew what was wrong. I got the launch parameters in the wrong order. Thanks for the reply.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Having an issue with __shfl functions in HIP with ROCm #5478

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Having an issue with __shfl functions in HIP with ROCm #5478

Uh oh!

Uh oh!

cgbriggs99 Oct 7, 2025

Replies: 2 comments

Uh oh!

Uh oh!

amd-nicknick Oct 27, 2025 Collaborator

Uh oh!

cgbriggs99 Oct 28, 2025 Author

cgbriggs99
Oct 7, 2025

amd-nicknick
Oct 27, 2025
Collaborator

cgbriggs99
Oct 28, 2025
Author