2

Is there an officially sanctioned way to reuse shared data between global functions?

Consider the following code https://cuda.godbolt.org/z/KMj9EKKbf:

#include <cuda.h>
#include <stdio.h>

__global__ void kernel_A(int** s) {
  __shared__ int sdata;
  sdata = clock();
  *s = &sdata;
  printf("sdata: %p = %i\n", &sdata, sdata);
}

//somewhat surprisingly, inserting a device function with __shared__ data
//does not move kernel_B's shared data. 
//Maybe this is just luck?
__device__ int& make_shared() {
  __shared__ int bdata;
  bdata = 1;
  auto& result = bdata;
  return result;
}

__global__ void kernel_B(int** ptr_sdata) {
  auto& bdata = make_shared();
  printf("bdata: %p = %i\n", &bdata, bdata);
  __shared__ int sdata;
  printf("A.sdata: %p = %i, B.sdata: %p = %i\n", *ptr_sdata, *ptr_sdata[0], &sdata, sdata);
}

int main() {
  int** ptr_sdata;
  cudaMalloc(&ptr_sdata, sizeof(int*));
  kernel_A<<<1,1>>>(ptr_sdata);
  kernel_B<<<1,1>>>(ptr_sdata);
  cudaDeviceSynchronize();
}

This prints:

sdata: 0x7d52c5000000 = -594894858
bdata: 0x7d52c5000004 = 1
A.sdata: 0x7d52c5000000 = -594894858, B.sdata: 0x7d52c5000000 = -594894858

I understand that kernel_B might not get assigned to the same SM as kernel_A, but let's assume that I spawn as many kernels as there are SMs, so that every SM runs a single block.

Is there a way to reuse the shared memory contents from kernel_A in kernel_B reliably?

6
  • Having thought about it a little, from an optimization point of view, the overhead between two __global__ functions is about 25ish microseconds (if I remember correctly), so I guess that leaves plenty of time to shuttle the shared buffers to global memory in A and back again in B; but still, curious. Commented Jun 3 at 9:52
  • 3
    Completely undefined behaviour. The one word answer is no Commented Jun 3 at 9:56
  • 3
    In between your kernels the driver could schedule another workload from another stream or even another user/application depending on context. That other kernel could overwrite the data in shared memory. Commented Jun 3 at 12:45
  • Shared memory (SHM) stored SM cache. If reusing it would be possibly, it would only be beneficial in very very restricted cases and the performance improvement would be small anyway because SHM data can be stored in the pretty-fast L2 cache and possibly reloaded from the cache in the second kernel (not guaranteed though). Even a reload from the VRAM is not that expensive considering the tiny amount of SHM. For example, on a Nv 1660S, there is only 1.4 MiB of SHM so it should takes about 4 µs to write/read data from/to it. Certainly <2 µs for the L2 cache. Starting a kernel takes more than that Commented Jun 3 at 21:26
  • 1
    Also, if you have far more blocks than what can run concurrently in the available SM, then the blocks of the first kernels would override data of other blocks of the same kernel and since the order of the execution of blocks is unspecified/undocumented, reusing SHM make no sense in this case. This only makes sense if you can guarantee that all blocks can run concurrently, the second kernel is executed just after, and also have the same mapping (so with less concurrent blocks than the maximum amount possible). Commented Jun 3 at 21:30

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.