Cythonize away some perf hot spots by leofang · Pull Request #709 · NVIDIA/cuda-python

leofang · 2025-06-13T16:50:33Z

Description

Less aggressive version of #677.

Based on the summary in #658 (comment), this PR offers a performance optimization over identified hotspots to bring us to a lot closer with our reference (CuPy). The optimization strategy is to

implement everything in pure Python (we do this today for cuda.core)
once hotspots are identified, we lower to Cython
most importantly, we still call cuda.bindings Python APIs in the Cython code, so as to avoid introducing CTK as a build-time dependency (and therefore having to ship two separate packages cuda-core-cu11 and cuda-core-cu12)

In other words, this PR tries to find a reasonable balance between performance, easy of development, and easy of deployment, without introducing any breaking change.

Preliminary data:

cuda.core main branch

In [5]: %timeit e = dev.create_event()
4.65 μs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit s = dev.create_stream()
7.7 μs ± 9.98 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

this PR

In [6]: %timeit e = dev.create_event()
1.11 μs ± 6.91 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [4]: %timeit s = dev.create_stream()
4.12 μs ± 14 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

cupy

In [8]: %timeit e = cp.cuda.Event(disable_timing=True)
749 ns ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [14]: %timeit s = cp.cuda.Stream(non_blocking=True)
3.8 μs ± 8.54 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-06-13T16:50:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2025-06-25T04:13:48Z

/ok to test 227d9c1

leofang · 2025-06-25T18:52:23Z

/ok to test 48de1b3

cuda_core/cuda/core/experimental/_context.pyx

cuda_core/cuda/core/experimental/_event.pyx

cuda_core/cuda/core/experimental/_stream.pyx

cuda_core/cuda/core/experimental/_utils/cuda_utils.pyx

cuda_core/tests/test_cuda_utils.py

cuda_core/cuda/core/experimental/_event.pyx

cuda_core/cuda/core/experimental/_stream.pyx

oleksandr-pavlyk · 2025-07-01T01:00:18Z

/ok to test e95c4b1

emcastillo

LGTM!

oleksandr-pavlyk · 2025-07-01T13:47:54Z

/ok to test f4531e5

oleksandr-pavlyk · 2025-07-01T13:54:40Z

cuda_core/cuda/core/experimental/_event.pyx

-        self._mnff = Event._MembersNeededForFinalize(self, None)
-
-        options = check_or_create_options(EventOptions, options, "Event options")
+    def _init(cls, device_id: int, ctx_handle: Context, options=None):


Since Event contains native class members, perhaps adding __cinit__ to initialize them is appropriate. Something like

def __cinit__(self): self._timing_disabled = False self._busy_waited = False self._device_id = -1

I also think it would be safe to set object class members to None.

This would ensure that Event.__new__(Event) would return an initialized struct.

I think Cython sets everything to None for us, but it'd be good to verify this indeed

Cython additionally takes responsibility of setting all object attributes to None,

https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#initialisation-methods-cinit-and-init

Ok, let's leave object members out. Should I push adding Event.__cinit__ ?

I think the same section says all members are zero/null initialized?

Yes, but is it appropriate to zero initialize _device_id? Perhaps it does not matter much.

leofang · 2025-07-01T14:50:54Z

CI is green

oleksandr-pavlyk

LGTM!

github-actions · 2025-07-01T21:20:19Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

leofang added 2 commits June 13, 2025 14:52

cythonize event

8e56d60

cythonize context/event/util

9b96cc7

leofang added this to the cuda.core parking lot milestone Jun 13, 2025

leofang self-assigned this Jun 13, 2025

leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jun 13, 2025

github-project-automation bot added this to CCCL Jun 13, 2025

github-project-automation bot moved this to Todo in CCCL Jun 13, 2025

leofang modified the milestones: cuda.core parking lot, cuda.core beta 5 Jun 18, 2025

leofang added 2 commits June 24, 2025 00:09

Merge branch 'main' into drop_to_cython

09e7ecc

Cython 3.0+ supports __del__ for cdef classes

3495a1f

leofang force-pushed the drop_to_cython branch from 7313a80 to 3495a1f Compare June 25, 2025 01:22

inline precondition to reduce overhead

6f346e4

leofang force-pushed the drop_to_cython branch from 02c8e4e to 6f346e4 Compare June 25, 2025 01:53

leofang added 2 commits June 25, 2025 02:58

centralize check_or_create_options

4282b99

add back error check

7b45954

leofang linked an issue Jun 25, 2025 that may be closed by this pull request

[FEA]: Faster initialization time for cuda.core abstractions #658

Closed

1 task

leofang added 2 commits June 25, 2025 04:10

cythonize stream + bug fixes

3ff5e94

make linter happy

227d9c1

bug fix

e96bb4a

This comment has been minimized.

Sign in to view

Cython mis-compiles Optional types

48de1b3

leofang closed this Jun 25, 2025

leofang requested review from oleksandr-pavlyk and shwina June 30, 2025 19:12

leofang marked this pull request as ready for review June 30, 2025 19:12

kkraus14 reviewed Jun 30, 2025

View reviewed changes

leofang added 2 commits June 30, 2025 22:11

cache success enums

cc6339e

Merge branch 'main' into drop_to_cython

7d68db2

oleksandr-pavlyk reviewed Jun 30, 2025

View reviewed changes

nit: avoid cdef void

e95c4b1

kkraus14 previously approved these changes Jul 1, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to In Review in CCCL Jul 1, 2025

emcastillo approved these changes Jul 1, 2025

View reviewed changes

In Event.close set handle to None before raising error

f4531e5

oleksandr-pavlyk dismissed kkraus14’s stale review via f4531e5 July 1, 2025 13:47

oleksandr-pavlyk reviewed Jul 1, 2025

View reviewed changes

leofang mentioned this pull request Jul 1, 2025

cuda_core forward compatibility changes. #722

Merged

kkraus14 approved these changes Jul 1, 2025

View reviewed changes

oleksandr-pavlyk approved these changes Jul 1, 2025

View reviewed changes

leofang merged commit 79e4bf7 into NVIDIA:main Jul 1, 2025
102 of 103 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Jul 1, 2025

leofang deleted the drop_to_cython branch July 1, 2025 20:59

leofang mentioned this pull request Jul 1, 2025

[DONT MERGE] Make stream creation faster #677

Closed

leofang mentioned this pull request Jul 2, 2025

Fix CI build-time parallelism + support the same env var in cuda.core #743

Merged

2 tasks

leofang mentioned this pull request Aug 21, 2025

Cythonize Buffer #756

Closed

Copilot AI mentioned this pull request Aug 21, 2025

Cythonize Buffer and MemoryResource classes for performance optimization #876

Merged

leofang mentioned this pull request Aug 26, 2025

Handle cuda.core.Stream in driver operations NVIDIA/numba-cuda#401

Merged

leofang mentioned this pull request Sep 27, 2025

RFC: Cythonize cuda.core while keeping it CUDA-agnostic #866

Open

Conversation

leofang commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jun 13, 2025

Uh oh!

leofang commented Jun 25, 2025

Uh oh!

This comment has been minimized.

leofang commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oleksandr-pavlyk commented Jul 1, 2025

Uh oh!

emcastillo left a comment

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk commented Jul 1, 2025

Uh oh!

oleksandr-pavlyk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

oleksandr-pavlyk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

leofang commented Jul 1, 2025

Uh oh!

oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leofang commented Jun 13, 2025 •

edited

Loading