Skip to content

Conversation

@mingzhe09088
Copy link
Contributor

Summary:
Pull Request resolved: #45900

Use torch:cuda::nccl:all2all from ProcesGroupNCCL.cpp

Fixes #42517

Here is a NCCL dependency graph:

libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------

When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless -whole-archive option is used. Before #42514 all nccl call made from ProcessGroupNCCL.cpp were also made from torch/csrc/cuda/nccl.cpp, which is compiled as part of libtorch_cuda.so
But adding ncclSend|ncclRecv to ProcesGroupNCCL.cpp forced linker to embed those into libtorch_python.so, which also resulted in linking other dependent symbols into the library.

This PR adds nccl[Send|Recv] call to torch_cuda.so by implementing all2all in torch_cuda and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in torch::cuda::nccl namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1

Fixes #{issue number}

@dr-ci
Copy link

dr-ci bot commented Dec 4, 2020

💊 CI failures summary and remediations

As of commit 445963c (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 2/3 non-CircleCI failure(s)

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


Extra GitHub checks: 1 failed


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 10 times.

Summary:
Pull Request resolved: pytorch#45899

Use function polymorphism to avoid repeated casts
I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument
Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24138012

Pulled By: malfet

fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1
Summary:
Pull Request resolved: pytorch#45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes pytorch#42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before pytorch#42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2021

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
Stale pull requests will automatically be closed 30 days after being marked Stale

@github-actions github-actions bot added the Stale label Feb 3, 2021
@github-actions github-actions bot closed this Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants