Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Mar 2, 2022

Stack from ghstack:

This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an init() method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
getBackendName() is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.

Differential Revision: D34596295

This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an `init()` method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
`getBackendName()` is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.

Differential Revision: [D34596295](https://our.internmc.facebook.com/intern/diff/D34596295/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 2, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/672e75e6290c70bc5b74745729340000ed1fe965/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 2, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 672e75e (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.7-clang7-asan / test (default, 1, 3, linux.2xlarge) (1/1)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2022-03-02T23:55:20.4525630Z SUMMARY: Undefined.../jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in
2022-03-02T23:55:20.3994564Z     #10 0x5622745c7801 in run_mod /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:1037
2022-03-02T23:55:20.3995120Z     #11 0x5622745d27a9 in PyRun_StringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:961
2022-03-02T23:55:20.3995515Z     #12 0x5622745d280b in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1627392990942/work/Python/pythonrun.c:455
2022-03-02T23:55:20.3997588Z     #13 0x5622745d2908 in pymain_run_command /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:420
2022-03-02T23:55:20.3998348Z     #14 0x5622745d2908 in pymain_run_python /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:2907
2022-03-02T23:55:20.3998836Z     #15 0x5622745d2908 in pymain_main /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3460
2022-03-02T23:55:20.3999161Z     #16 0x5622745d2ccb in _Py_UnixMain /tmp/build/80754af9/python_1627392990942/work/Modules/main.c:3495
2022-03-02T23:55:20.4524194Z     #17 0x7f4d8e3b183f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
2022-03-02T23:55:20.4524889Z     #18 0x562274577554 in _start (/opt/conda/bin/python3.7+0x1d7554)
2022-03-02T23:55:20.4525130Z 
2022-03-02T23:55:20.4525630Z SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:20:3 in 
2022-03-02T23:55:20.4840163Z + retcode=1
2022-03-02T23:55:20.4840702Z + set -e
2022-03-02T23:55:20.4840956Z + return 1
2022-03-02T23:55:20.4844364Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX-* ]]
2022-03-02T23:55:20.4844845Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X ]]
2022-03-02T23:55:20.4845424Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX2-* ]]
2022-03-02T23:55:20.4845920Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]]
2022-03-02T23:55:20.4846526Z + [[ linux-xenial-py3.7-clang7-asan-default == *-NO_AVX512-* ]]
2022-03-02T23:55:20.4847016Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\5\1\2 ]]
2022-03-02T23:55:20.4849921Z + [[ linux-xenial-py3.7-clang7-asan-default == *tbb* ]]

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Mar 2, 2022
pritamdamania87 pushed a commit that referenced this pull request Mar 2, 2022
This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an `init()` method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
`getBackendName()` is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.

Differential Revision: [D34596295](https://our.internmc.facebook.com/intern/diff/D34596295/)

ghstack-source-id: 150381852
Pull Request resolved: #73702
protected:
// Implementations of this interface need to call this to setup
// appropriate logging etc.
void init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a more descriptive name for this function. init sounds a bit confusing since the whole purpose of the base constructor is to initialize the object. How about record_usage(), or something along the line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was using the two-phase initialization pattern described here: https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctor-idiom. Although, currently the init() method only does API usage logging. It can be repurposed to do additional initialization as well later. Otherwise we would end up with multiple methods like record_usage(), record_foo(), record_bar()` etc. that all need to be called by the subclasses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree that the name record_usage() (deliberately) reduces the scope of the function, I would consider it a design smell if we needed to have further logic in our constructor that depends on virtual calls. Even in this particular case the cleaner, but BC-breaking, solution would be to take the backend name as a constructor parameter.

Having said that I don't have a strong objection about the name init(). One thing that we should mind though is if we want to preserve BC-compatibility, and consider init() as a place where we can potentially add further initialization logic, we have to make sure that all ProcessGroup implementations out there are updated to call it. Otherwise it will silently break implementations that don't call init().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree that the name record_usage() (deliberately) reduces the scope of the function, I would consider it a design smell if we needed to have further logic in our constructor that depends on virtual calls. Even in this particular case the cleaner, but BC-breaking, solution would be to take the backend name as a constructor parameter.

I agree with this and if we were building this from scratch we wouldn't implement it this way. But the tricky thing is to do this in a backward compatible way. The solution in this PR is already a workaround and is probably the best way I know off to achieve this logging. The reason I chose init() was because I worry about the fact of having to do something like this a couple of more times.

Having said that I don't have a strong objection about the name init(). One thing that we should mind though is if we want to preserve BC-compatibility, and consider init() as a place where we can potentially add further initialization logic, we have to make sure that all ProcessGroup implementations out there are updated to call it. Otherwise it will silently break implementations that don't call init().

I did update a bunch of ProcessGroups in our repo already. Once this lands, I can probably create a PR for other repos. But there probably will be many ProcessGroups out there that we don't know about and hence I left a comment in our docs as a best effort mechanism.

facebook-github-bot pushed a commit that referenced this pull request Mar 8, 2022
Summary:
Pull Request resolved: #73702

This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an `init()` method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
`getBackendName()` is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.
ghstack-source-id: 150381852

Test Plan: check logging

Reviewed By: cbalioglu

Differential Revision: D34596295

fbshipit-source-id: 5baa7bfb53bdcd1b4032292c4e7715467f983288
@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2022

Hey @pritamdamania87.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 9, 2022
Summary:
Pull Request resolved: pytorch/pytorch#73702

This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an `init()` method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
`getBackendName()` is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.
ghstack-source-id: 150381852

Test Plan: check logging

Reviewed By: cbalioglu

Differential Revision: D34596295

fbshipit-source-id: 5baa7bfb53bdcd1b4032292c4e7715467f983288
(cherry picked from commit b2344144fe0c25db02c2e958f2acd772e9cd4dc8)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 9, 2022
Summary:
Pull Request resolved: pytorch/pytorch#73702

This PR adds logging to track usage of ProcessGroup backends. The
change involves adding an `init()` method to the ProcessGroup base class and
modifying all the subclasses to call it.

We can't add logging directly in the ProcessGroup constructor since
`getBackendName()` is pure virtual and can't be called from the base class
constructor. See
https://isocpp.org/wiki/faq/strange-inheritance#calling-virtuals-from-ctors for
more details.
ghstack-source-id: 150381852

Test Plan: check logging

Reviewed By: cbalioglu

Differential Revision: D34596295

fbshipit-source-id: 5baa7bfb53bdcd1b4032292c4e7715467f983288
(cherry picked from commit b2344144fe0c25db02c2e958f2acd772e9cd4dc8)
@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/249/head branch March 12, 2022 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants