Skip to content

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Oct 6, 2020

Stack from ghstack:

Use torch:cuda::nccl:all2all from ProcesGroupNCCL.cpp

Fixes #42517

Here is a NCCL dependency graph:

libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------

When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless -whole-archive option is used. Before #42514 all nccl call made from ProcessGroupNCCL.cpp were also made from torch/csrc/cuda/nccl.cpp, which is compiled as part of libtorch_cuda.so
But adding ncclSend|ncclRecv to ProcesGroupNCCL.cpp forced linker to embed those into libtorch_python.so, which also resulted in linking other dependent symbols into the library.

This PR adds nccl[Send|Recv] call to torch_cuda.so by implementing all2all in torch_cuda and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in torch::cuda::nccl namespace, instead of making direct NCCL API calls.

Differential Revision: D24138011

Use it from ProcesGroupNCCL

`torch/csrc/cuda/nccl.cpp` is compiled as part of `torch_cuda` library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into `torch_python`

Fixes #42517

[ghstack-poisoned]
@malfet malfet mentioned this pull request Oct 6, 2020
malfet added a commit that referenced this pull request Oct 6, 2020
Use it from ProcesGroupNCCL

`torch/csrc/cuda/nccl.cpp` is compiled as part of `torch_cuda` library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into `torch_python`

Fixes #42517

ghstack-source-id: 1c5f90f
Pull Request resolved: #45900
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2020
@malfet malfet requested a review from ngimel October 6, 2020 15:03
@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in c19b9cd.

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in c19b9cd.

@facebook-github-bot facebook-github-bot deleted the gh/malfet/2/head branch October 11, 2020 14:18
mingzhe09088 pushed a commit to mingzhe09088/pytorch-1 that referenced this pull request Dec 4, 2020
Summary:
Pull Request resolved: pytorch#45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes pytorch#42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before pytorch#42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants