-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[XPU]fix cuda event used in XPU model runner #23708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a compatibility issue with torch.cuda.Event on XPU devices by introducing a _torch_cuda_wrapper context manager to monkey-patch torch.cuda.Event during the initialization of XPUModelRunner.
While the approach of using a context manager to temporarily patch the torch.cuda.Event is sound, the current implementation has a critical flaw. It permanently replaces torch.cuda.Event with a placeholder, which can cause side effects across the application. My review includes a critical comment with a suggested fix to properly restore the original torch.cuda.Event after use, ensuring the change is safe and contained.
| class _EventPlaceholder: | ||
|
|
||
| def __init__(self, *args, **kwargs) -> None: | ||
| self.record = lambda: None | ||
| self.synchronize = lambda: None | ||
|
|
||
| try: | ||
| # replace cuda Event with xpu Event, this should work by default | ||
| torch.cuda.Event = torch.xpu.Event | ||
| yield | ||
| finally: | ||
| # if anything goes wrong, just patch it with a placeholder | ||
| torch.cuda.Event = _EventPlaceholder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of _torch_cuda_wrapper permanently alters the global torch.cuda.Event by setting it to _EventPlaceholder in the finally block. This is a dangerous side effect that can break functionality in other parts of the codebase that rely on the original torch.cuda.Event, especially in mixed-device environments or if other operations use torch.cuda.Event after this runner is initialized.
The finally block should restore the original state of torch.cuda.Event instead of replacing it with a placeholder. This ensures that the monkey-patching is contained within the with statement and does not leak. The _EventPlaceholder class is also no longer needed with this change.
| class _EventPlaceholder: | |
| def __init__(self, *args, **kwargs) -> None: | |
| self.record = lambda: None | |
| self.synchronize = lambda: None | |
| try: | |
| # replace cuda Event with xpu Event, this should work by default | |
| torch.cuda.Event = torch.xpu.Event | |
| yield | |
| finally: | |
| # if anything goes wrong, just patch it with a placeholder | |
| torch.cuda.Event = _EventPlaceholder | |
| original_event = getattr(torch.cuda, "Event", None) | |
| try: | |
| # replace cuda Event with xpu Event, this should work by default | |
| torch.cuda.Event = torch.xpu.Event | |
| yield | |
| finally: | |
| # Restore the original torch.cuda.Event. | |
| if original_event is not None: | |
| torch.cuda.Event = original_event | |
| elif hasattr(torch.cuda, "Event"): | |
| # If it didn't exist before, remove our monkey-patched attribute. | |
| delattr(torch.cuda, "Event") |
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Purpose
#22760 introduced
torch.cuda.Eventin GPUModelRunner init method. this cause some compatible issue in xpu. this PR fix it by adding a api wrapper.Test Plan
CI.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.