Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Mar 11, 2022

Stack from ghstack (oldest at bottom):

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547

Differential Revision: D34772067

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/53b0f6532437b631c95ec3362d916d9c972723be/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-manywheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
windows-binary-libtorch-debug ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-libtorch-release ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-wheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7-distributed ciflow/all, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 11, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 62f0b92 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 11, 2022
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 11, 2022
Pull Request resolved: #74113

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547
ghstack-source-id: 151159056

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 11, 2022
Pull Request resolved: #74113

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547
ghstack-source-id: 151191152

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 14, 2022
Pull Request resolved: #74113

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547
ghstack-source-id: 151275647

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp to unblock


// Broadcast and verify parameter size.
std::vector<at::Tensor> param_size_vec{param_size_tensor};
process_group->broadcast(param_size_vec)->wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (can be done as future BE work item): DDP ctor will broadcast param values anyway. Is it possible to implement these two broadcasts in the same location using two async ops? I recall the param value broadcast was done in the distributed.py file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's actually 3 collectives that go on now:

  1. parameter size verification
  2. parameter shape verification
  3. parameter broadcast

We could make these all async, and wait on all three, and register then() callbacks on the first two to raise appropriate errors. Filed #74185

std::vector<at::Tensor> param_size_vec{param_size_tensor};
process_group->broadcast(param_size_vec)->wait();
auto res = param_size_tensor[0].item<int>();
TORCH_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, does this mean some DDP process will crash, while other won't if their size matches? If that's the case, does it mean allgather might be better, as all processes can make the same decision here and throw the same error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, it would be a lot better if all processes make the same decision. Next update will fix this

group_gloo = dist.new_group(
timeout=timedelta(seconds=60), backend=dist.Backend.GLOO
)
# Set NCCL_BLOCKING_WAIT and use a new NCCL group to improve test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why this can help improve determinism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like NCCL_BLOCKING_WAIT so that rank 0 doesn't need to be taken down by async error handling which is non-deterministic when it kicks in and runs, and it also takes down the process which is hard to accomodate for in unittest.

As a result, we also need a new NCCL group as the default NCCL group is initialized before this test executes as a part of setup. I'll add some clarifying comments.

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Mar 14, 2022
Pull Request resolved: #74113

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547
ghstack-source-id: 151319259

Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
facebook-github-bot pushed a commit that referenced this pull request Mar 15, 2022
Summary:
Pull Request resolved: #74113

Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.

Closes #73547
ghstack-source-id: 151319259

Test Plan: UT

Reviewed By: mrshenli

Differential Revision: D34772067

fbshipit-source-id: 456933111e9996823f1a220b474998e17fb74210
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/520/head branch March 19, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants