[Resubmit #41318] NCCL backend support for torch bool #41959

rohan-varma · 2020-07-23T23:27:44Z

Resubmit of #41318 pushed to ci-all branch.

Original description:
Closes #24137.
This PR adds support for the torch.bool tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since bool is not supported as a native ncclDataType_t, we add the following logic:

Map at::kBool to ncclUint8
During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.
The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see #41362

Closes #24137. Since bool is not supported as a native ncclDataType_t, we add some upcasting + downcasting logic to support it. Differential Revision: [D22496604](https://our.internmc.facebook.com/intern/diff/D22496604/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22496604/)! [ghstack-poisoned]

Closes #24137. This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic: 1) Detect if input tensors are of bool type. If so, cast inputs & outputs to int tensors. 2) Run the specified reduction. 3) If we had to cast, cast the outputs back to boolean tensors. If this collective does not operator in-place, then re-cast inputs back to bool so that they are not modified as a result of the op. The reduction logic (for example for reduce/allreduce) is as follows: sum, max = bitwise or product, min = bitwise and Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see #41362 Tests are added to ensure that the reductions work as expected. Differential Revision: [D22496604](https://our.internmc.facebook.com/intern/diff/D22496604/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22496604/)! [ghstack-poisoned]

Closes #24137. This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic: 1) Map `at::kBool` to `ncclUint8` 2) During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR. The reduction logic (for example for reduce/allreduce) is as follows: sum, max = bitwise or product, min = bitwise and Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see #41362 Tests are added to ensure that the reductions work as expected. Differential Revision: [D22496604](https://our.internmc.facebook.com/intern/diff/D22496604/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22496604/)! [ghstack-poisoned]

facebook-github-bot

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rohan-varma · 2020-07-24T18:31:18Z

@mrshenli, The tests for this diff should be fixed now and ci-all is green. The bug was that for nccl_bool_reduce test, we should only do the assert on the rank which gets the final reduce tensor, not the other ranks.

facebook-github-bot

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dr-ci · 2020-07-24T21:05:39Z

💊 CI failures summary and remediations

As of commit 73040f7 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_windows_vs2017_14.13_py36_cuda10.1_build (1/1)

Step: "Checkout code" (full log | diagnosis details | 🔁 rerun) ❄️

Writing SSH key for checkout to id_rsa

Creating .ssh directory
Adding the following entries to known_hosts:
github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDXCaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPWc/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==

Writing SSH key for checkout to id_rsa

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

facebook-github-bot · 2020-07-25T08:12:22Z

@rohan-varma merged this pull request in 366c014.

) Summary: Pull Request resolved: facebookresearch#64 support torch.bool and convert it to UCC UINT8 - enable collectives for tensor with torch.bool - add `to_ucc_reduceOp` to handle special case for reduction operations (SUM->MAX, PROD->MIN, and other operations should not be supported) - add `to_ucc_dType` to check if the size of `torch.bool` is 1 since it's implementation-defined. Abort the job if size of `torch.bool` is not 1. Reference: - support in NCCL PG: pytorch/pytorch#41959 Differential Revision: D35301907 fbshipit-source-id: 53e7c771633e68abfb81981d3e0aa5b723e96bd2

Summary: Pull Request resolved: #64 support torch.bool and convert it to UCC UINT8 - enable collectives for tensor with torch.bool - add `to_ucc_reduceOp` to handle special case for reduction operations (SUM->MAX, PROD->MIN, and other operations should not be supported) - add `to_ucc_dType` to check if the size of `torch.bool` is 1 since it's implementation-defined. Abort the job if size of `torch.bool` is not 1. Reference: - support in NCCL PG: pytorch/pytorch#41959 Reviewed By: bryanmr Differential Revision: D35301907 fbshipit-source-id: eba1264c91ff97a708289e66d763919505e9250f

rohan-varma added 9 commits July 12, 2020 01:09

rohan-varma requested review from mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners July 23, 2020 23:27

rohan-varma removed request for mrshenli, pietern, pritamdamania87 and zhaojuanmao July 23, 2020 23:27

facebook-github-bot reviewed Jul 24, 2020

View reviewed changes

Fix test

e3ef98f

rohan-varma requested a review from mrshenli July 24, 2020 18:31

facebook-github-bot reviewed Jul 24, 2020

View reviewed changes

mrshenli approved these changes Jul 24, 2020

View reviewed changes

Merge remote-tracking branch 'origin/master' into bool

73040f7

facebook-github-bot reviewed Jul 24, 2020

View reviewed changes

facebook-github-bot closed this in 366c014 Jul 25, 2020

facebook-github-bot added the merged label Jul 25, 2020

mruberry added the Merged label Oct 28, 2020

facebook-github-bot deleted the ci-all/rohan/nccl_bool branch January 27, 2021 18:26

kingchc mentioned this pull request Apr 6, 2022

support torch.bool datatype conversion to UCC UINT8 facebookresearch/torch_ucc#64

Closed

ghost mentioned this pull request Feb 6, 2023

Complex Tensors ctlllll/SGConv#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Resubmit #41318] NCCL backend support for torch bool #41959

[Resubmit #41318] NCCL backend support for torch bool #41959

Uh oh!

rohan-varma commented Jul 23, 2020 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

rohan-varma commented Jul 24, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

dr-ci bot commented Jul 24, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Resubmit #41318] NCCL backend support for torch bool #41959

[Resubmit #41318] NCCL backend support for torch bool #41959

Uh oh!

Conversation

rohan-varma commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Jul 24, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_windows_vs2017_14.13_py36_cuda10.1_build (1/1)

Uh oh!

facebook-github-bot commented Jul 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohan-varma commented Jul 23, 2020 •

edited

Loading

dr-ci bot commented Jul 24, 2020 •

edited

Loading