3

I need to use a function like cudaDeviceSynchronize to wait for a kernel to finish execution. However after version 11.6, it is no longer possible to use any form of synchronization within device functions. How can we synchronize subkernels in device function?

Here's the code that I'm trying to run:

__global__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int outputidx = idx / outputsize;
    int inputidx = idx % outputsize;

    if (outputidx >= outputsize || inputidx >= inputsize) {
        return;
    }

    atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}

__global__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= size) {
        return;
    }

    result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}

__global__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK;
    NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
    //normally cudaDeviceSynchronize() kind of function to wait for child kernel to finish;
    NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}

Thanks!

8
  • 1
    "Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6 and removed for compute_90+ compilation. For compute capability < 9.0, compile-time opt-in by specifying -DCUDA_FORCE_CDP1_IF_SUPPORTED is required to continue using cudaDeviceSynchronize() in device code. Note that this is slated for full removal in a future CUDA release." CUDA C++ Programming Guide Commented Mar 9 at 15:12
  • 1
    As you may find this particular methodology is deprecated entirely and you should rethink your code forums.developer.nvidia.com/t/… That methodology is deprecated. So you should not synchronize on a child kernel completion, in a parent kernel. If you need to consume results from the child kernel in device code, launch a new kernel. Yes, this will require refactoring your code. There is no zero-impact workaround that I am aware of. Commented Mar 9 at 15:14
  • 3
    stackoverflow.com/questions/79432271/… Commented Mar 9 at 18:55
  • 2
    That's not the reason it works in my example. In my example that uses two kernel launches from device code, the 2nd kernel launch is guaranteed not to start until the first kernel launch is finished. This is guaranteed by CUDA stream semantics: items issued into the same stream are serialized, in issue-order. Furthermore, changes made to the global space by one kernel are guaranteed to be visible to the next, by the kernel launch boundary. I actually state all this on example 3 in my answer. Commented Mar 10 at 18:41
  • 3
    What you have depicted in your question (without the cudaDeviceSynchronize()) is exactly what I am suggesting in my example 3. The cudaDeviceSynchronize() is unnecessary. The instance of NNFeedForwardNormalActivate will not begin executing until the NNFeedForwardNormalMultiple is finished. CUDA stream semantics guarantee this, even in device code. It actually works the same way as it works in host code, due to stream semantics. And the changes made by the first launch to the global space will be visible to the second launch. Commented Mar 10 at 18:44

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.