Cythonize away some perf hot spots#709
Conversation
|
/ok to test 227d9c1 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 48de1b3 |
|
/ok to test e95c4b1 |
|
/ok to test f4531e5 |
| self._mnff = Event._MembersNeededForFinalize(self, None) | ||
|
|
||
| options = check_or_create_options(EventOptions, options, "Event options") | ||
| def _init(cls, device_id: int, ctx_handle: Context, options=None): |
There was a problem hiding this comment.
Since Event contains native class members, perhaps adding __cinit__ to initialize them is appropriate. Something like
def __cinit__(self):
self._timing_disabled = False
self._busy_waited = False
self._device_id = -1
I also think it would be safe to set object class members to None.
This would ensure that Event.__new__(Event) would return an initialized struct.
There was a problem hiding this comment.
I think Cython sets everything to None for us, but it'd be good to verify this indeed
Cython additionally takes responsibility of setting all object attributes to None,
There was a problem hiding this comment.
Ok, let's leave object members out. Should I push adding Event.__cinit__ ?
There was a problem hiding this comment.
I think the same section says all members are zero/null initialized?
There was a problem hiding this comment.
Yes, but is it appropriate to zero initialize _device_id? Perhaps it does not matter much.
|
CI is green |
|
Description
Less aggressive version of #677.
Based on the summary in #658 (comment), this PR offers a performance optimization over identified hotspots to bring us to a lot closer with our reference (CuPy). The optimization strategy is to
cuda.bindingsPython APIs in the Cython code, so as to avoid introducing CTK as a build-time dependency (and therefore having to ship two separate packagescuda-core-cu11andcuda-core-cu12)In other words, this PR tries to find a reasonable balance between performance, easy of development, and easy of deployment, without introducing any breaking change.
Preliminary data:
Checklist