Skip to content

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Feb 25, 2022

Stack from ghstack (oldest at bottom):

Implement the _clip_grad_norm_ for FSDP, issue: #72548

Differential Revision: D34230605

Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 25, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit a87e2f9 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 25, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/fe9a43c47fb3e25d39ab87edcb46a4180500cb54/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 25, 2022
fduwjj added a commit that referenced this pull request Feb 25, 2022
Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

ghstack-source-id: 149925805
Pull Request resolved: #73405
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work, thanks so much for persisting through all the usability issues and giving feedback on them offline! Added some comments for your consideration.

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! thanks for adding this.

for tests, would you please add a norm_type config to _train_for_several_steps() in common_fsdp.py and compare ddp vs fsdp parity, similar to fairscale tests in _train_for_several_steps of test_fsdp.py for test_clip_norm_transformer and test_mixture_of_experts_grad_clip_breaks

Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

[ghstack-poisoned]
Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Mar 7, 2022
Pull Request resolved: #73405

Implement the `_clip_grad_norm_` for FSDP, issue: #72548
ghstack-source-id: 150655449

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)
@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 7, 2022

Addressed the comment from reviewers.

Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Mar 9, 2022
Pull Request resolved: #73405

Implement the `_clip_grad_norm_` for FSDP, issue: #72548
ghstack-source-id: 150951935

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)
@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 9, 2022

Adding the test coverage for the nested model.

@fduwjj fduwjj requested a review from rohan-varma March 9, 2022 20:46
Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks great to me, just one minor comment, will let @rohan-varma to accept it.

Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Mar 10, 2022
Pull Request resolved: #73405

Implement the `_clip_grad_norm_` for FSDP, issue: #72548
ghstack-source-id: 150990293

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)
@fduwjj
Copy link
Contributor Author

fduwjj commented Mar 10, 2022

Further address reviewer's comment.

@fduwjj fduwjj requested a review from zhaojuanmao March 10, 2022 01:15
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, thanks for persisting through all of the issues and addressing all the comments! Looks great to ship.

Implement the `_clip_grad_norm_` for FSDP, issue: #72548

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Mar 10, 2022
Pull Request resolved: #73405

Implement the `_clip_grad_norm_` for FSDP, issue: #72548
ghstack-source-id: 151059433

Differential Revision: [D34230605](https://our.internmc.facebook.com/intern/diff/D34230605/)
facebook-github-bot pushed a commit that referenced this pull request Mar 11, 2022
Summary:
Pull Request resolved: #73405

Implement the `_clip_grad_norm_` for FSDP, issue: #72548
ghstack-source-id: 151059433

Test Plan: CI

Reviewed By: rohan-varma

Differential Revision: D34230605

fbshipit-source-id: bbac7a6e49276e0f0502e2f4466c984aee2629fa
@github-actions
Copy link
Contributor

Hey @fduwjj.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue topic: new features topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants