I need to use a function like cudaDeviceSynchronize to wait for a kernel to finish execution. However after version 11.6, it is no longer possible to use any form of synchronization within device functions. How can we synchronize subkernels in device function?
Here's the code that I'm trying to run:
__global__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int outputidx = idx / outputsize;
int inputidx = idx % outputsize;
if (outputidx >= outputsize || inputidx >= inputsize) {
return;
}
atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}
__global__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= size) {
return;
}
result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}
__global__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK;
NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
//normally cudaDeviceSynchronize() kind of function to wait for child kernel to finish;
NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}
Thanks!
cudaDeviceSynchronize()in device code) is deprecated in CUDA 11.6 and removed for compute_90+ compilation. For compute capability < 9.0, compile-time opt-in by specifying-DCUDA_FORCE_CDP1_IF_SUPPORTEDis required to continue usingcudaDeviceSynchronize()in device code. Note that this is slated for full removal in a future CUDA release." CUDA C++ Programming GuidecudaDeviceSynchronize()) is exactly what I am suggesting in my example 3. ThecudaDeviceSynchronize()is unnecessary. The instance ofNNFeedForwardNormalActivatewill not begin executing until theNNFeedForwardNormalMultipleis finished. CUDA stream semantics guarantee this, even in device code. It actually works the same way as it works in host code, due to stream semantics. And the changes made by the first launch to the global space will be visible to the second launch.