-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Make ProcessGroupNCCL load torch_ucc.so when TORCH_UCC_LIBRARY_PATH is set #69552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 0e710a7 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But his PR can be landed separately without waiting for facebookresearch/torch_ucc#56 because, in PyTorch's unit tests, UCC is never used or tested.
I do feel if we add some functionality to PyTorch, there should be tests covering this functionality. What would it take to add some tests for the UCC functionality?
|
@pritamdamania87 I agree that when we add new functionality, there should be test. Actually, the majority part of my work is to incrementally enable most
With this plan, the first two PR will NOT be tested. Does this plan sound good to you? If you prefer, I can merge the first three PRs into one. The PR will be larger, and a bit difficult to review, but it ensures everything added to it is tested. |
|
|
||
| static std::once_flag initialize_ucc_lib_flag; | ||
| std::call_once(initialize_ucc_lib_flag, [&]{ | ||
| ucc_lib_ = loadTorchUCC(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just load UCC when CUDA/GPU is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCC will be used to handle CPU tensors for ProcessGroupNCCL, so no, even if CUDA/GPU is available, UCC will still be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about UCC dependence. Does OSS UCC depend on NCCL? If so, when UCC is loaded through torch_ucc, is it possible that NCCL will also be loaded? Will this end up with two different versions of NCCL being loaded into PyTorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, UCC supports using NCCL (as well as UCX and other libraries) as its transport. But in our case, we are not using this functionality. When we use UCC, we should be using the UCX backend and when we want to use NCCL, we will use it directly.
For the question of having multiple versions of NCCL, my understanding is:
If both PyTorch and UCC are dynamically linked to NCCL
The documentation of dlopen[1] says:
Symbol references in the shared object are resolved using (in order): symbols in the link map of objects loaded for the main program and its dependencies; symbols in shared objects (and their dependencies) that were previously opened with dlopen() using the RTLD_GLOBAL flag; and definitions in the shared object itself (and any dependencies that were loaded for that object).
Because PyTorch is always loaded earlier than torch_ucc, by the time torch_ucc is loaded, symbols of NCCL should have already been available in the main program. So dlopen will just use these symbols for UCC.
If both PyTorch and UCC are statically linked to NCCL
For static linking, symbol resolution should happen at link time. So PyTorch will have its copy of NCCL code, and UCC will also have its copy. Each will use its own copy. The NCCL symbols in libucc.so will not be added to the main program for later symbol resolution because we dlopen it with RTLD_LOCAL, and according to the documentation[1]:
This is the converse of RTLD_GLOBAL, and the default if neither flag is specified. Symbols defined in this shared object are not made available to resolve references in subsequently loaded shared objects.
If PyTorch is statically linked to NCCL and UCC is dynamically linked to NCCL
For this case, PyTorch will use the NCCL that it is compiled with because symbol resolution happened at link time. I don't know which NCCL UCC will use, because I don't know if NCCL's symbol will be copied to PyTorch and added to the link map. If yes, then UCC will use that one from PyTorch. If not, then UCC will loads its own NCCL and use the symbol there.
If PyTorch is dynamically linked to NCCL and UCC is statically linked to NCCL
PyTorch is always loaded earlier than torch_ucc, so PyTorch will load NCCL from the system and use that NCCL. But UCC will still use its own NCCL because all the NCCL function calls are already resolved when libucc.so is generated.
Reference
|
At high level, is there a plan to introduce a cmake flag to control whether UCC is on or off? |
|
I have resolved all review comments, including the cmake flag. |
|
@mingzhe09088 @jiayisuse @pritamdamania87 Does this PR and the RFC #70654 look good to you? |
CMakeLists.txt
Outdated
| cmake_dependent_option( | ||
| USE_C10D_NCCL "USE C10D NCCL" ON "USE_DISTRIBUTED;USE_NCCL" OFF) | ||
| cmake_dependent_option( | ||
| USE_NCCL_WITH_UCC "Enable UCC support for ProcessGroupNCCL. Only available if USE_C10D_NCCL is on." ON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this cmake flag if we are anyways only conditionally loading the ucc lib at runtime? What is the downside if we don't have this flag at all and default to true? Would there be some issues in the case where ucc lib is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this based on #69552 (comment). Right now, it might seem a bit redundant, but in the future, when ProcessGroupNCCL starts to automatically search and load torch_ucc.so from the user's system, having a cmake flag will be helpful. In my opinion, if we are sure we will need one in the future, it won't be bad to have it now, but I can remove this flag if you want.
|
|
||
| #ifdef USE_NCCL_WITH_UCC | ||
| static std::once_flag initialize_ucc_lib_flag; | ||
| std::call_once(initialize_ucc_lib_flag, [&]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pass & into the lambda? uccLib_ is static and we shouldn't need & to access it in the lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for rank_, otherwise we will have
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp: In lambda function:
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:599:32: error: ‘this’ was not captured for this lambda function
599 | LOG(INFO) << "[Rank " << rank_ << "] torch_ucc.so loaded";
| ^~~~~
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:599:32: error: invalid use of non-static data member ‘c10d::ProcessGroup::rank_’
In file included from /home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:13,
from /home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1:
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroup.hpp:431:13: note: declared here
431 | const int rank_;
| ^~~~~
| uccLib_ = loadTorchUCC(); | ||
| if (uccLib_ != nullptr) { | ||
| LOG(INFO) << "[Rank " << rank_ << "] torch_ucc.so loaded"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In an else clause should we log that torch_ucc.so was not loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on #69552 (comment)
|
This PR should be ready for review. The only thing unsure is whether we should keep or remove the compile-time flag. There are suggestions, such as in #70654 (comment), that a compile flag is not needed. I don't have a strong preference whether or not to keep such a compile-time flag, it's totally up to Meta to decide :) |
|
@jiayisuse has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Any update @jiayisuse ? |
|
|
||
| namespace c10d { | ||
|
|
||
| std::shared_ptr<at::DynamicLibrary> loadTorchUCC() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this interface as inline? Otherwise it causes conflicting symbol in building time. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Imported and doing another round of tests
|
@jiayisuse has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…s set (#69552) Summary: This is the very first step for the UCC-NCCL integration. This PR lets `ProcessGroupNCCL` load the `torch_ucc.so` if the user specifies an environmental variable `TORCH_UCC_LIBRARY_PATH`. If this environment variable is not specified by the user, then there will be no visible change. In the future, we may want to make PyTorch smart enough to automatically detect the `torch_ucc.so` in the user's system, but before doing that, I believe we should first make sure that `ProcessGroupUCC` is very well tested. Note that in this PR, `ProcessGroupNCCL` just loads the library but will not use it. I am trying to make PRs small, so the usage of `torch_ucc.so` will be submitted in later PRs. This PR requires the change in facebookresearch/torch_ucc#56, otherwise `torch_ucc.so` can not be successfully loaded. But his PR can be landed separately without waiting for facebookresearch/torch_ucc#56 because, in PyTorch's unit tests, UCC is never used or tested. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: #69552 Reviewed By: mruberry Differential Revision: D34675212 Pulled By: jiayisuse fbshipit-source-id: a3d1fb98340dbe3a931af555423863efd381f1ae
|
Hey @zasdfgbnm. |
Summary: This PR is based on #69552, please review that PR first. This PR requires facebookresearch/torch_ucc#57, but this PR can be landed separately because in PyTorch's unit tests, UCC is never used or tested. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: #69564 Reviewed By: mingzhe09088 Differential Revision: D35457276 Pulled By: jiayisuse fbshipit-source-id: 662c5a771c7cfc92ab42955c9abd24e56e5cafda
Summary: This PR is based on #69552, please review that PR first. This PR requires facebookresearch/torch_ucc#57, but this PR can be landed separately because in PyTorch's unit tests, UCC is never used or tested. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: #69564 Reviewed By: mingzhe09088 Differential Revision: D35457276 Pulled By: jiayisuse fbshipit-source-id: 662c5a771c7cfc92ab42955c9abd24e56e5cafda (cherry picked from commit 985b5f8)
This is the very first step for the UCC-NCCL integration. This PR lets
ProcessGroupNCCLload thetorch_ucc.soif the user specifies an environmental variableTORCH_UCC_LIBRARY_PATH. If this environment variable is not specified by the user, then there will be no visible change.In the future, we may want to make PyTorch smart enough to automatically detect the
torch_ucc.soin the user's system, but before doing that, I believe we should first make sure thatProcessGroupUCCis very well tested.Note that in this PR,
ProcessGroupNCCLjust loads the library but will not use it. I am trying to make PRs small, so the usage oftorch_ucc.sowill be submitted in later PRs.This PR requires the change in facebookresearch/torch_ucc#56, otherwise
torch_ucc.socan not be successfully loaded. But his PR can be landed separately without waiting for facebookresearch/torch_ucc#56 because, in PyTorch's unit tests, UCC is never used or tested.cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang