Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Outstanding issues:
|
Stream class does not have _handle data member.
This is necessary to avoid circular dependency. Cluster-related occupancy functions need LaunchConfig. Occupancy functions are defined in _module.py, and _launcher.py that used to house definition of LaunchConfig imports Kernel from _module.py
This class defines kernel occupancy query methods. - max_active_blocks_per_multiprocessor - max_potential_block_size - available_dynamic_shared_memory_per_block - max_potential_cluster_size - max_active_clusters Implementation is based on driver API. The following occupancy-related driver functions are not used - `cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags` - `cuOccupancyMaxPotentialBlockSizeWithFlags` In `cuOccupancyMaxPotentialBlockSize`, only constant dynamic shared-memory size is supported for now. Supporting variable dynamic shared-memory size that depends on the block size is deferred until design is resolved.
1e77bc0 to
b89c95f
Compare
|
/ok to test |
This comment has been minimized.
This comment has been minimized.
Use it as return type for the KernelOccupancy.max_potential_block_size output.
cuda_utils.driver.CUoccupancyB2DSize type is supported. Required size of dynamic shared memory allocation renamed to dynamic_shared_memory_needed
Test requires Numba. If numba is absent, it is skipped, otherwise `numba.cfunc` is used to compile Python function. ctypes.CFuncPtr object obtained from cfunc_res.ctypes is converted to CUoccupancyB2DSize.
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
In case we do not want // gcc -fshared -fPIC b2dsize.c -o b2dsize.so
#include <stddef.h>
size_t dynamic_shared_memory_needed(int blockSize) {
return (blockSize <= 32) ? (size_t)0 : (size_t)((blockSize - 1) / 32) * ((size_t)1024);
}Then import ctypes
from cuda.core.experimental._utils.cuda_utils import driver
lib = ctypes.cdll.LoadLibrary("./b2dsize.so")
cfunc = lib.dynamic_shared_memory_needed
fn_ptr = ctypes.cast(cfunc, ctypes.c_void_p).value
dynamic_smem_needed_fn = driver.CUoccupancyB2DSize(_ptr = fn_ptr)This would require compiler being available at test time, which is easy to arrange in conda, at least. We could build a fixture that would build such a library, and skip the test if building step fails due to absent compiler. |
|
/ok to test |
|
|
/ok to test 436f111 |
|
cc @dongxiao92 @pentschev @bandokihiro for vis |
Expanded the docstring, added advisory about possibility of deadlocks should function encoded CUoccupancyB2DSize require GIL. Added argument type validation for dynamic_shared_memory_needed argument.
|
Performed additional manual testing with Cython-generated C-API functions produced using Steps to create Cython extension and run testsCreate Cython source file# filename: cyx_b2ds.pyx
cdef inline int align_up(int num, int den) nogil:
return ((num + den - 1) // den) * den
cdef inline size_t smem_needed(int block_size, size_t smem_bytes_per_warp) nogil:
cdef int warp_size = 32
cdef int bs = block_size * (block_size > 0)
return (<size_t>align_up(bs, warp_size)) * smem_bytes_per_warp
cdef api size_t smem_needed_64(int block_size) nogil:
return smem_needed(block_size, 64)
cdef api size_t smem_needed_96(int block_size) nogil:
return smem_needed(block_size, 96)
cdef api size_t smem_needed_128(int block_size) nogil:
return smem_needed(block_size, 128)
cdef api size_t smem_needed_196(int block_size) nogil:
return smem_needed(block_size, 196)
cdef api size_t smem_needed_256(int block_size) nogil:
return smem_needed(block_size, 256)
cdef api size_t smem_needed_384(int block_size) nogil:
return smem_needed(block_size, 384)
cdef api size_t smem_needed_512(int block_size) nogil:
return smem_needed(block_size, 512)
cdef api size_t smem_needed_gil(int block_size):
return smem_needed(block_size, 32)Compile and buildcython -3 cyx_b2ds.pyx
cc cyx_b2ds.c -shared -fPIC $(python3-config --cflags) $(python3-config --ldflags) -o cyx_b2ds$(python3-config --extension-suffix)Run testimport cuda.core.experimental as cc
cc.Device(0).set_current()
o1 = cc.Program("__global__ void bar(double *p, int n, double x) { *p = n * x; }", code_type="c++").compile("cubin", name_expressions=("bar",))
k1 = o1.get_kernel("bar")
import ctypes
import cyx_b2ds as ext
from cuda.core.experimental._utils.cuda_utils import driver
gp_fn = ctypes.pythonapi.PyCapsule_GetPointer
gp_fn.restype, gp_fn.argtypes = ctypes.c_void_p, [ctypes.py_object, ctypes.c_char_p]
def get_capi_fn_ptr(name):
caps = ext.__pyx_capi__[name]
capi_ptr = gp_fn(caps, b'size_t (int)')
return driver.CUoccupancyB2DSize(_ptr=capi_ptr) |
To expand on this and capture offline discussion... The concern here is that we have two global locks in play, 1 from the Python Global Interpreter Lock, and 1 from the CUDA driver. We risk running into the following situation:
This would lead to a deadlock and we've seen this behavior in the past, i.e. numba/numba#4581 |
leofang
left a comment
There was a problem hiding this comment.
Thanks, Sasha! LGTM overall, most comments below are doc-related.
For example, we need to add _launch_config.LaunchConfig, _module.KernelOccupancy, etc, to cuda_core/docs/source/api_private.rst to get them rendered and cross-ref'd.
Occupancy tests need not contain saxpy in the test name even though it uses saxpy kernel for testing.
|
/ok to test 496eb5b |
leofang
left a comment
There was a problem hiding this comment.
LGTM, thanks Sasha! I made a doc-only fix. The CI was green so let me admin-merge to save some resources.
|
Description
closes #504
Checklist