Feature/occupancy by oleksandr-pavlyk · Pull Request #648 · NVIDIA/cuda-python

oleksandr-pavlyk · 2025-05-20T16:55:21Z

Description

closes #504

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-05-20T16:55:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

oleksandr-pavlyk · 2025-05-20T17:01:17Z

Outstanding issues:

Write tests
Should we support variable dynamic shared memory size dependent on block size in max_potential_block_size?
In method names use fully spelled shared_memory_size, should shmem_size be used instead?
For cluster-related occupancy queries, driver API takes CUlaunchConfig struct which maps to pair LaunchConfig data class instance and Stream class instance, because the data class does not contain the stream information. Is this design decision acceptable?

Stream class does not have _handle data member.

This is necessary to avoid circular dependency. Cluster-related occupancy functions need LaunchConfig. Occupancy functions are defined in _module.py, and _launcher.py that used to house definition of LaunchConfig imports Kernel from _module.py

This class defines kernel occupancy query methods. - max_active_blocks_per_multiprocessor - max_potential_block_size - available_dynamic_shared_memory_per_block - max_potential_cluster_size - max_active_clusters Implementation is based on driver API. The following occupancy-related driver functions are not used - `cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags` - `cuOccupancyMaxPotentialBlockSizeWithFlags` In `cuOccupancyMaxPotentialBlockSize`, only constant dynamic shared-memory size is supported for now. Supporting variable dynamic shared-memory size that depends on the block size is deferred until design is resolved.

oleksandr-pavlyk · 2025-05-20T18:17:04Z

/ok to test

Use it as return type for the KernelOccupancy.max_potential_block_size output.

cuda_utils.driver.CUoccupancyB2DSize type is supported. Required size of dynamic shared memory allocation renamed to dynamic_shared_memory_needed

Test requires Numba. If numba is absent, it is skipped, otherwise `numba.cfunc` is used to compile Python function. ctypes.CFuncPtr object obtained from cfunc_res.ctypes is converted to CUoccupancyB2DSize.

copy-pr-bot · 2025-05-22T21:38:53Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

oleksandr-pavlyk · 2025-05-22T21:57:06Z

In case we do not want numba as test dependency (even optional), we could consider compiling the following b2dsize.c library:

// gcc -fshared -fPIC b2dsize.c -o b2dsize.so
#include <stddef.h>

size_t dynamic_shared_memory_needed(int blockSize) {
  return (blockSize <= 32) ? (size_t)0 : (size_t)((blockSize - 1)  / 32) * ((size_t)1024);
}

Then

import ctypes
from cuda.core.experimental._utils.cuda_utils import driver

lib = ctypes.cdll.LoadLibrary("./b2dsize.so")
cfunc = lib.dynamic_shared_memory_needed
fn_ptr = ctypes.cast(cfunc, ctypes.c_void_p).value

dynamic_smem_needed_fn = driver.CUoccupancyB2DSize(_ptr = fn_ptr)

This would require compiler being available at test time, which is easy to arrange in conda, at least.

We could build a fixture that would build such a library, and skip the test if building step fails due to absent compiler.

oleksandr-pavlyk · 2025-05-23T14:54:15Z

/ok to test

leofang · 2025-05-29T19:41:10Z

document that Python callbacks are uncharted territory (due to GIL contention)
add release note

leofang · 2025-05-29T19:47:39Z

/ok to test 436f111

leofang · 2025-05-29T22:54:34Z

cc @dongxiao92 @pentschev @bandokihiro for vis

Expanded the docstring, added advisory about possibility of deadlocks should function encoded CUoccupancyB2DSize require GIL. Added argument type validation for dynamic_shared_memory_needed argument.

oleksandr-pavlyk · 2025-05-30T16:36:44Z

Performed additional manual testing with Cython-generated C-API functions produced using api keyword, including CAPI functions holding GIL.

Steps to create Cython extension and run tests

Create Cython source file

# filename: cyx_b2ds.pyx

cdef inline int align_up(int num, int den) nogil:
    return ((num + den - 1) // den) * den

cdef inline size_t smem_needed(int block_size, size_t smem_bytes_per_warp) nogil:
    cdef int warp_size = 32
    cdef int bs = block_size * (block_size > 0)
    return (<size_t>align_up(bs, warp_size)) * smem_bytes_per_warp 

cdef api size_t smem_needed_64(int block_size) nogil:
    return smem_needed(block_size, 64)

cdef api size_t smem_needed_96(int block_size) nogil:
    return smem_needed(block_size, 96)

cdef api size_t smem_needed_128(int block_size) nogil:
    return smem_needed(block_size, 128)

cdef api size_t smem_needed_196(int block_size) nogil:
    return smem_needed(block_size, 196)

cdef api size_t smem_needed_256(int block_size) nogil:
    return smem_needed(block_size, 256)

cdef api size_t smem_needed_384(int block_size) nogil:
    return smem_needed(block_size, 384)

cdef api size_t smem_needed_512(int block_size) nogil:
    return smem_needed(block_size, 512)

cdef api size_t smem_needed_gil(int block_size):
    return smem_needed(block_size, 32)

Compile and build

cython -3 cyx_b2ds.pyx
cc cyx_b2ds.c -shared -fPIC $(python3-config --cflags) $(python3-config --ldflags) -o cyx_b2ds$(python3-config --extension-suffix)

Run test

import cuda.core.experimental as cc
cc.Device(0).set_current()

o1 = cc.Program("__global__ void bar(double *p, int n, double x) { *p = n * x; }", code_type="c++").compile("cubin", name_expressions=("bar",))
k1 = o1.get_kernel("bar")

import ctypes
import cyx_b2ds as ext
from cuda.core.experimental._utils.cuda_utils import driver

gp_fn = ctypes.pythonapi.PyCapsule_GetPointer
gp_fn.restype, gp_fn.argtypes = ctypes.c_void_p, [ctypes.py_object, ctypes.c_char_p]

def get_capi_fn_ptr(name):
    caps = ext.__pyx_capi__[name]
    capi_ptr = gp_fn(caps, b'size_t (int)')
    return driver.CUoccupancyB2DSize(_ptr=capi_ptr)


<<< Elided execution of code-block given above >>>

In [2]: [k1.occupancy.max_potential_block_size(get_capi_fn_ptr(name), 0) for name in ['smem_needed_64', 'smem_needed_96', 'smem_needed_128', 'smem_needed_196', 'smem_needed_256', 'smem_needed_384', 'smem_needed_512']]
Out[2]: 
[MaxPotential(min_grid_size=168, max_block_size=768),
 MaxPotential(min_grid_size=168, max_block_size=512),
 MaxPotential(min_grid_size=168, max_block_size=384),
 MaxPotential(min_grid_size=252, max_block_size=160),
 MaxPotential(min_grid_size=168, max_block_size=192),
 MaxPotential(min_grid_size=168, max_block_size=128),
 MaxPotential(min_grid_size=168, max_block_size=96)]

In [3]: k1.occupancy.max_potential_block_size(get_capi_fn_ptr('smem_needed_gil'), 0)
Out[3]: MaxPotential(min_grid_size=168, max_block_size=768)

In [4]: quit

kkraus14 · 2025-05-30T17:09:44Z

document that Python callbacks are uncharted territory (due to GIL contention)

To expand on this and capture offline discussion... The concern here is that we have two global locks in play, 1 from the Python Global Interpreter Lock, and 1 from the CUDA driver. We risk running into the following situation:

Thread 1: Releases the GIL and acquires the CUDA driver lock when calling cuOccupancyMaxPotentialBlockSize, gets blocked in waiting to reacquire the GIL for the callback function
Thread 2: Potentially calls a CUDA API without releasing the GIL, gets blocked in waiting to acquire the CUDA driver lock while holding the GIL

This would lead to a deadlock and we've seen this behavior in the past, i.e. numba/numba#4581

leofang

Thanks, Sasha! LGTM overall, most comments below are doc-related.

For example, we need to add _launch_config.LaunchConfig, _module.KernelOccupancy, etc, to cuda_core/docs/source/api_private.rst to get them rendered and cross-ref'd.

cuda_core/cuda/core/experimental/_module.py

cuda_core/tests/test_module.py

Occupancy tests need not contain saxpy in the test name even though it uses saxpy kernel for testing.

leofang · 2025-06-04T02:27:59Z

/ok to test 496eb5b

cuda_core/cuda/core/experimental/_module.py

leofang

LGTM, thanks Sasha! I made a doc-only fix. The CI was green so let me admin-merge to save some resources.

github-actions · 2025-06-04T03:41:33Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

oleksandr-pavlyk added cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do! labels May 20, 2025

oleksandr-pavlyk added 4 commits May 20, 2025 13:12

Fix typo: stream._handle -> stream.handle

f79a7f0

Stream class does not have _handle data member.

Add occupancy tests, except for cluster-related queries

b89c95f

oleksandr-pavlyk force-pushed the feature/occupancy branch from 1e77bc0 to b89c95f Compare May 20, 2025 18:16

This comment has been minimized.

Sign in to view

leofang self-requested a review May 20, 2025 19:47

leofang assigned oleksandr-pavlyk May 20, 2025

leofang added this to the cuda.core beta 4 milestone May 20, 2025

oleksandr-pavlyk added 5 commits May 22, 2025 15:32

Fix type in querying handle from Stream argument

9679e0e

Add tests for cluster-related occupancy descriptors

ff322ec

Introduce MaxPotentialBlockSizeOccupancyResult named tuple

fd8302f

Use it as return type for the KernelOccupancy.max_potential_block_size output.

KernelOccupancy.max_potential_block_size support for CUoccupancyB2DSize

40d799a

cuda_utils.driver.CUoccupancyB2DSize type is supported. Required size of dynamic shared memory allocation renamed to dynamic_shared_memory_needed

Add test for B2DSize usage in max_potential_block_size

5968ff0

Test requires Numba. If numba is absent, it is skipped, otherwise `numba.cfunc` is used to compile Python function. ctypes.CFuncPtr object obtained from cfunc_res.ctypes is converted to CUoccupancyB2DSize.

oleksandr-pavlyk marked this pull request as ready for review May 22, 2025 21:38

Merge branch 'main' into feature/occupancy

fdbad93

Merge branch 'main' into feature/occupancy

436f111

Improved max_potential_block_size.__doc__

428f4fa

Expanded the docstring, added advisory about possibility of deadlocks should function encoded CUoccupancyB2DSize require GIL. Added argument type validation for dynamic_shared_memory_needed argument.

oleksandr-pavlyk added 2 commits May 30, 2025 10:08

Add test for dynamic_shared_memory_needed arg of invalid type

f1ff0f5

Mention feature/occupancy in 0.3.0 release notes

39a08f6

leofang reviewed May 31, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_module.py Outdated Show resolved Hide resolved

cuda_core/cuda/core/experimental/_module.py Outdated Show resolved Hide resolved

cuda_core/tests/test_module.py Show resolved Hide resolved

cuda_core/tests/test_module.py Outdated Show resolved Hide resolved

leofang mentioned this pull request Jun 3, 2025

Support cooperative launch #676

Merged

2 tasks

oleksandr-pavlyk added 3 commits June 3, 2025 16:34

Add symbols to api_private.rst

f74dcf1

Reduce test name verbosity

e2adc57

Occupancy tests need not contain saxpy in the test name even though it uses saxpy kernel for testing.

Add doc-strings to KernelOccupancy methods.

496eb5b

leofang reviewed Jun 4, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_module.py Outdated Show resolved Hide resolved

fix rendering

f74db2c

leofang approved these changes Jun 4, 2025

View reviewed changes

leofang merged commit 19c4169 into NVIDIA:main Jun 4, 2025
1 check passed

Conversation

oleksandr-pavlyk commented May 20, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented May 20, 2025

Uh oh!

oleksandr-pavlyk commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleksandr-pavlyk commented May 20, 2025

Uh oh!

This comment has been minimized.

copy-pr-bot bot commented May 22, 2025

Uh oh!

oleksandr-pavlyk commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleksandr-pavlyk commented May 23, 2025

Uh oh!

leofang commented May 29, 2025

Uh oh!

leofang commented May 29, 2025

Uh oh!

leofang commented May 29, 2025

Uh oh!

oleksandr-pavlyk commented May 30, 2025

Create Cython source file

Compile and build

Run test

Uh oh!

kkraus14 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented Jun 4, 2025

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oleksandr-pavlyk commented May 20, 2025 •

edited

Loading

oleksandr-pavlyk commented May 22, 2025 •

edited

Loading

kkraus14 commented May 30, 2025 •

edited

Loading