[Release/1.7] Enable NCCL A2A on OSS #48857
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Pull Request resolved: #45900
Use
torch:cuda::nccl:all2allfromProcesGroupNCCL.cppFixes #42517
Here is a NCCL dependency graph:
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless
-whole-archiveoption is used. Before #42514 all nccl call made fromProcessGroupNCCL.cppwere also made fromtorch/csrc/cuda/nccl.cpp, which is compiled as part oflibtorch_cuda.soBut adding
ncclSend|ncclRecvto ProcesGroupNCCL.cpp forced linker to embed those intolibtorch_python.so, which also resulted in linking other dependent symbols into the library.This PR adds
nccl[Send|Recv]call totorch_cuda.soby implementingall2allintorch_cudaand thus avoids double linking the static library.More involved, but prone solution, would be to use wrappers exported in
torch::cuda::ncclnamespace, instead of making direct NCCL API calls.Test Plan: Imported from OSS
Reviewed By: mingzhe09088
Differential Revision: D24138011
Pulled By: malfet
fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Fixes #{issue number}