-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add NCCL Alltoall to PT NCCL process group #42514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D22917967 |
💊 CI failures summary and remediationsAs of commit c6a84b7 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 7 times. |
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test errors are irrelevant. Thanks for adding this!
Aug 04 06:24:51 ERROR test/onnx/test_pytorch_onnx_onnxruntime.py - NameError: name 'enableScr...
Aug 04 06:24:51 ERROR test/onnx/test_pytorch_onnx_onnxruntime.py - NameError: name 'enableScr...
Aug 04 06:24:51 ERROR test/onnx/test_pytorch_onnx_onnxruntime.py - NameError: name 'enableScr...
Summary: Pull Request resolved: pytorch#42514 Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv. Reviewed By: mrshenli Differential Revision: D22917967 fbshipit-source-id: e3345ddbb5217dc5fabcd5c02db4cf67becff4b5
|
This pull request was exported from Phabricator. Differential Revision: D22917967 |
af2fa21 to
c6a84b7
Compare
|
This pull request has been merged in ecb88c5. |
Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes #42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so | ^ | | --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before #42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Differential Revision: [D24138011](https://our.internmc.facebook.com/intern/diff/D24138011) [ghstack-poisoned]
Summary: Pull Request resolved: #45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes #42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so | ^ | | --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before #42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Summary: Pull Request resolved: pytorch#45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes pytorch#42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so | ^ | | --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before pytorch#42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Summary: In #42514, NCCL `alltoall_single` is already added. This PR adds NCCL `alltoall`. The difference between `alltoall_single` and `alltoall` is: `alltoall_single` works on a single tensor and send/receive slices of that tensor, while `alltoall` works on a list of tensor, and send/receive tensors in that list. cc: ptrblck ngimel Pull Request resolved: #44374 Reviewed By: zhangguanheng66, mrshenli Differential Revision: D24455427 Pulled By: srinivas212 fbshipit-source-id: 42fdebdd14f8340098e2c34ef645bd40603552b1
Differential Revision: D22917967