-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Additional error checking for torch.cuda.nccl APIs.
#43247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional error checking for torch.cuda.nccl APIs.
#43247
Conversation
`torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. Differential Revision: [D23206069](https://our.internmc.facebook.com/intern/diff/D23206069/) [ghstack-poisoned]
`torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. Differential Revision: [D23206069](https://our.internmc.facebook.com/intern/diff/D23206069/) ghstack-source-id: 110214324 Pull Request resolved: #43247
💊 CI failures summary and remediationsAs of commit e2f1eff (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 10 times. |
torch/cuda/nccl.py
Outdated
| def _check_sequence_type(inputs): | ||
| if not isinstance(inputs, list) and not isinstance(inputs, tuple): | ||
| raise TypeError("inputs should be a list/tuple") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use 2 line spacing here?
torch/cuda/nccl.py
Outdated
|
|
||
| def _check_sequence_type(inputs): | ||
| if not isinstance(inputs, list) and not isinstance(inputs, tuple): | ||
| raise TypeError("inputs should be a list/tuple") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it also be any python iterable/list-like object? I just tried the NCCL test with inputs = iter(inputs) (which makes it a list_iterable type) and inputs = set(inputs), and the current test_all_gather seems to pass. With this PR, would we stop supporting those types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried this approach out first, but apparently a standalone tensor is Iterable as well:
>>> isinstance(torch.rand(10), collections.Iterable)
True
As a result, if we check for just Iterable we might include a single Tensor as a valid input. I'll actually update the check with a combination of both checks to incorporate this.
`torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. Differential Revision: [D23206069](https://our.internmc.facebook.com/intern/diff/D23206069/) [ghstack-poisoned]
Pull Request resolved: #43247 `torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. ghstack-source-id: 110484323 Differential Revision: [D23206069](https://our.internmc.facebook.com/intern/diff/D23206069/)
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
`torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. Differential Revision: [D23206069](https://our.internmc.facebook.com/intern/diff/D23206069/) [ghstack-poisoned]
Pull Request resolved: #43247 `torch.cuda.nccl` APIs didn't throw appropriate errors when called with inputs/outputs that were of the wrong type and it resulted in some cryptic errors instead. Adding some error checks with explicit error messages for these APIs. ghstack-source-id: 110683546 Differential Revision: [D23206069](https://our.internmc.facebook.com/intern/diff/D23206069/)
|
This pull request has been merged in 306eb3d. |
Stack from ghstack:
torch.cuda.ncclAPIs. #43247 Additional error checking fortorch.cuda.ncclAPIs.torch.cuda.ncclAPIs didn't throw appropriate errors when calledwith inputs/outputs that were of the wrong type and it resulted in some cryptic
errors instead.
Adding some error checks with explicit error messages for these APIs.
Differential Revision: D23206069