Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Dec 7, 2021

This is the very first step for the UCC-NCCL integration. This PR lets ProcessGroupNCCL load the torch_ucc.so if the user specifies an environmental variable TORCH_UCC_LIBRARY_PATH. If this environment variable is not specified by the user, then there will be no visible change.

In the future, we may want to make PyTorch smart enough to automatically detect the torch_ucc.so in the user's system, but before doing that, I believe we should first make sure that ProcessGroupUCC is very well tested.

Note that in this PR, ProcessGroupNCCL just loads the library but will not use it. I am trying to make PRs small, so the usage of torch_ucc.so will be submitted in later PRs.

This PR requires the change in facebookresearch/torch_ucc#56, otherwise torch_ucc.so can not be successfully loaded. But his PR can be landed separately without waiting for facebookresearch/torch_ucc#56 because, in PyTorch's unit tests, UCC is never used or tested.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 7, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 0e710a7 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 7, 2021
@pytorch-probot
Copy link

pytorch-probot bot commented Dec 7, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/214d33a509349431f4e80b4720c34377687fba18/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-binary-conda ciflow/binaries, ciflow/binaries/conda 🚫 skipped
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-manywheel ciflow/binaries, ciflow/binaries/wheel 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-py3.6-clang9 ciflow/xla 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But his PR can be landed separately without waiting for facebookresearch/torch_ucc#56 because, in PyTorch's unit tests, UCC is never used or tested.

I do feel if we add some functionality to PyTorch, there should be tests covering this functionality. What would it take to add some tests for the UCC functionality?

@zasdfgbnm
Copy link
Collaborator Author

@pritamdamania87 I agree that when we add new functionality, there should be test. Actually, the majority part of my work is to incrementally enable most torch.distributed tests on UCC. My plan is:

  • This is the first PR, it just adds the code that loads torch_ucc.so and does nothing more.
  • In the second PR, I will do a bit more: this time, ProcessGroupNCCL will create an instance of ProcessGroupUCC and store its pointer. But this pointer is not used.
  • In the third PR, I will make ProcessGroupNCCL dispatch to ProcessGroupUCC on CPU for send, recv, and recvAnysource, enable unit tests for them on CPU, and modify the CI pipelines to install torch_ucc.so so that these tests are run.
  • After the third PR, more functionality and tests will be incrementally enabled, and every PR will be tested.

With this plan, the first two PR will NOT be tested. Does this plan sound good to you? If you prefer, I can merge the first three PRs into one. The PR will be larger, and a bit difficult to review, but it ensures everything added to it is tested.

@mrshenli mrshenli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 8, 2021

static std::once_flag initialize_ucc_lib_flag;
std::call_once(initialize_ucc_lib_flag, [&]{
ucc_lib_ = loadTorchUCC();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just load UCC when CUDA/GPU is not available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCC will be used to handle CPU tensors for ProcessGroupNCCL, so no, even if CUDA/GPU is available, UCC will still be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about UCC dependence. Does OSS UCC depend on NCCL? If so, when UCC is loaded through torch_ucc, is it possible that NCCL will also be loaded? Will this end up with two different versions of NCCL being loaded into PyTorch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, UCC supports using NCCL (as well as UCX and other libraries) as its transport. But in our case, we are not using this functionality. When we use UCC, we should be using the UCX backend and when we want to use NCCL, we will use it directly.

For the question of having multiple versions of NCCL, my understanding is:

If both PyTorch and UCC are dynamically linked to NCCL

The documentation of dlopen[1] says:

Symbol references in the shared object are resolved using (in order): symbols in the link map of objects loaded for the main program and its dependencies; symbols in shared objects (and their dependencies) that were previously opened with dlopen() using the RTLD_GLOBAL flag; and definitions in the shared object itself (and any dependencies that were loaded for that object).

Because PyTorch is always loaded earlier than torch_ucc, by the time torch_ucc is loaded, symbols of NCCL should have already been available in the main program. So dlopen will just use these symbols for UCC.

If both PyTorch and UCC are statically linked to NCCL

For static linking, symbol resolution should happen at link time. So PyTorch will have its copy of NCCL code, and UCC will also have its copy. Each will use its own copy. The NCCL symbols in libucc.so will not be added to the main program for later symbol resolution because we dlopen it with RTLD_LOCAL, and according to the documentation[1]:

This is the converse of RTLD_GLOBAL, and the default if neither flag is specified. Symbols defined in this shared object are not made available to resolve references in subsequently loaded shared objects.

If PyTorch is statically linked to NCCL and UCC is dynamically linked to NCCL

For this case, PyTorch will use the NCCL that it is compiled with because symbol resolution happened at link time. I don't know which NCCL UCC will use, because I don't know if NCCL's symbol will be copied to PyTorch and added to the link map. If yes, then UCC will use that one from PyTorch. If not, then UCC will loads its own NCCL and use the symbol there.

If PyTorch is dynamically linked to NCCL and UCC is statically linked to NCCL

PyTorch is always loaded earlier than torch_ucc, so PyTorch will load NCCL from the system and use that NCCL. But UCC will still use its own NCCL because all the NCCL function calls are already resolved when libucc.so is generated.

Reference

@mingzhe09088
Copy link
Contributor

At high level, is there a plan to introduce a cmake flag to control whether UCC is on or off?

@zasdfgbnm
Copy link
Collaborator Author

I have resolved all review comments, including the cmake flag.

@zasdfgbnm
Copy link
Collaborator Author

@mingzhe09088 @jiayisuse @pritamdamania87 Does this PR and the RFC #70654 look good to you?

CMakeLists.txt Outdated
cmake_dependent_option(
USE_C10D_NCCL "USE C10D NCCL" ON "USE_DISTRIBUTED;USE_NCCL" OFF)
cmake_dependent_option(
USE_NCCL_WITH_UCC "Enable UCC support for ProcessGroupNCCL. Only available if USE_C10D_NCCL is on." ON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this cmake flag if we are anyways only conditionally loading the ucc lib at runtime? What is the downside if we don't have this flag at all and default to true? Would there be some issues in the case where ucc lib is not found?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this based on #69552 (comment). Right now, it might seem a bit redundant, but in the future, when ProcessGroupNCCL starts to automatically search and load torch_ucc.so from the user's system, having a cmake flag will be helpful. In my opinion, if we are sure we will need one in the future, it won't be bad to have it now, but I can remove this flag if you want.


#ifdef USE_NCCL_WITH_UCC
static std::once_flag initialize_ucc_lib_flag;
std::call_once(initialize_ucc_lib_flag, [&]{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass & into the lambda? uccLib_ is static and we shouldn't need & to access it in the lambda?

Copy link
Collaborator Author

@zasdfgbnm zasdfgbnm Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for rank_, otherwise we will have

/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp: In lambda function:
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:599:32: error: ‘this’ was not captured for this lambda function
  599 |       LOG(INFO) << "[Rank " << rank_  << "] torch_ucc.so loaded";
      |                                ^~~~~
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:599:32: error: invalid use of non-static data member ‘c10d::ProcessGroup::rank_’
In file included from /home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:13,
                 from /home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1:
/home/gaoxiang/pytorch-ucc2/torch/csrc/distributed/c10d/ProcessGroup.hpp:431:13: note: declared here
  431 |   const int rank_;
      |             ^~~~~

uccLib_ = loadTorchUCC();
if (uccLib_ != nullptr) {
LOG(INFO) << "[Rank " << rank_ << "] torch_ucc.so loaded";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In an else clause should we log that torch_ucc.so was not loaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on #69552 (comment)

@zasdfgbnm
Copy link
Collaborator Author

This PR should be ready for review. The only thing unsure is whether we should keep or remove the compile-time flag. There are suggestions, such as in #70654 (comment), that a compile flag is not needed. I don't have a strong preference whether or not to keep such a compile-time flag, it's totally up to Meta to decide :)

@facebook-github-bot
Copy link
Contributor

@jiayisuse has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Mar 14, 2022

Any update @jiayisuse ?


namespace c10d {

std::shared_ptr<at::DynamicLibrary> loadTorchUCC() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this interface as inline? Otherwise it causes conflicting symbol in building time. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Imported and doing another round of tests

@facebook-github-bot
Copy link
Contributor

@jiayisuse has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 25, 2022
…s set (#69552)

Summary:
This is the very first step for the UCC-NCCL integration. This PR lets `ProcessGroupNCCL` load the `torch_ucc.so` if the user specifies an environmental variable `TORCH_UCC_LIBRARY_PATH`. If this environment variable is not specified by the user, then there will be no visible change.

In the future, we may want to make PyTorch smart enough to automatically detect the `torch_ucc.so` in the user's system, but before doing that, I believe we should first make sure that `ProcessGroupUCC` is very well tested.

Note that in this PR, `ProcessGroupNCCL` just loads the library but will not use it. I am trying to make PRs small, so the usage of `torch_ucc.so` will be submitted in later PRs.

This PR requires the change in facebookresearch/torch_ucc#56, otherwise `torch_ucc.so` can not be successfully loaded. But his PR can be landed separately without waiting for facebookresearch/torch_ucc#56 because, in PyTorch's unit tests, UCC is never used or tested.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: #69552

Reviewed By: mruberry

Differential Revision: D34675212

Pulled By: jiayisuse

fbshipit-source-id: a3d1fb98340dbe3a931af555423863efd381f1ae
@github-actions
Copy link
Contributor

Hey @zasdfgbnm.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@zasdfgbnm zasdfgbnm deleted the load-ucc.so branch March 25, 2022 19:26
facebook-github-bot pushed a commit that referenced this pull request Apr 21, 2022
Summary:
This PR is based on #69552, please review that PR first. This PR requires facebookresearch/torch_ucc#57, but this PR can be landed separately because in PyTorch's unit tests, UCC is never used or tested.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: #69564

Reviewed By: mingzhe09088

Differential Revision: D35457276

Pulled By: jiayisuse

fbshipit-source-id: 662c5a771c7cfc92ab42955c9abd24e56e5cafda
malfet pushed a commit that referenced this pull request Apr 22, 2022
Summary:
This PR is based on #69552, please review that PR first. This PR requires facebookresearch/torch_ucc#57, but this PR can be landed separately because in PyTorch's unit tests, UCC is never used or tested.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: #69564

Reviewed By: mingzhe09088

Differential Revision: D35457276

Pulled By: jiayisuse

fbshipit-source-id: 662c5a771c7cfc92ab42955c9abd24e56e5cafda
(cherry picked from commit 985b5f8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants