[FSDP] Fix summon_full_params test #74456

rohan-varma · 2022-03-21T07:17:24Z

Stack from ghstack (oldest at bottom):

This test did not actually do any CPU offloading. Adding it revealed
an issue in no_shard (currently only when world_size == 1) case which is
tracked for a fix in #74166

Differential Revision: D35003793

This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]

pytorch-bot · 2022-03-21T07:17:27Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/a71f967a76018e25f1ec23924b36651a10d9b637/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
deploy-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-manywheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
windows-binary-libtorch-debug	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-libtorch-release	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-wheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-bionic-rocm4.5-py3.7-distributed	`ciflow/all`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-03-21T07:17:29Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74456
Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit e86140c (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]

mrshenli · 2022-03-22T14:40:36Z

test/distributed/fsdp/test_fsdp_summon_full_params.py

+    lin1 = FSDP(nn.Linear(5, 5, bias=False).cuda(cls.rank), cpu_offload=cpu_offload)
+    lin2 = nn.Linear(5, 3, bias=False).cuda(cls.rank)
+    model = FSDP(nn.Sequential(lin1, lin2), cpu_offload=cpu_offload)


curious, what's the expectation on the input model when cpu_offload==True? Does it have to reside on GPU? Or is it OK to pass in a CPU model?

Passing CPU offload with CPU model should work, as we don't assume any device in the constructor and would restore to GPU in forward pass before all_gather. Although, it does not seem that we have a test for this, we should add one.

mrshenli · 2022-03-22T14:41:23Z

test/distributed/fsdp/test_fsdp_summon_full_params.py

+    lin2 = nn.Linear(5, 3, bias=False).cuda(cls.rank)
+    model = FSDP(nn.Sequential(lin1, lin2), cpu_offload=cpu_offload)
+    if not cpu_offload:
+        model = model.cuda(cls.rank)


why do we need this? I assume when cpu_offload==False, the model is already on cuda, no?

Right, we can remove this because lin1, lin2 are already on CUDA device.

This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]

Summary: Pull Request resolved: #74456 This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 ghstack-source-id: 152057449 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D35003793 fbshipit-source-id: ffb5f8ea897cda84584ae83d362bea3c0c407c3b

github-actions · 2022-03-24T04:57:49Z

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Pull Request resolved: #74456 This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 ghstack-source-id: 152057449 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D35003793 fbshipit-source-id: ffb5f8ea897cda84584ae83d362bea3c0c407c3b (cherry picked from commit 06afeba)

rohan-varma requested review from H-Huang, mrshenli, pritamdamania87 and zhaojuanmao as code owners March 21, 2022 07:17

pytorch-bot bot added the ciflow/default label Mar 21, 2022

facebook-github-bot added the cla signed label Mar 21, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 21, 2022

rohan-varma mentioned this pull request Mar 21, 2022

[FSDP] Investigate summon_full_params with CPU offloading #74166

Closed

rohan-varma mentioned this pull request Mar 21, 2022

[FSDP] Mixed precision enablement #74452

Closed

18 tasks

zhaojuanmao approved these changes Mar 21, 2022

View reviewed changes

rohan-varma added 2 commits March 21, 2022 07:04

rohan-varma mentioned this pull request Mar 22, 2022

[FSDP] named_buffers fix #74517

Closed

mrshenli reviewed Mar 22, 2022

View reviewed changes

suo removed the ciflow/default label Mar 22, 2022

pytorchmergebot closed this in a7b6b1f Mar 24, 2022

facebook-github-bot deleted the gh/rohan-varma/526/head branch March 27, 2022 14:17

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Fix summon_full_params test #74456

[FSDP] Fix summon_full_params test #74456

Uh oh!

rohan-varma commented Mar 21, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 21, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 21, 2022 •

edited

Loading

Uh oh!

mrshenli Mar 22, 2022

Uh oh!

rohan-varma Mar 23, 2022

Uh oh!

mrshenli Mar 22, 2022

Uh oh!

rohan-varma Mar 23, 2022

Uh oh!

github-actions bot commented Mar 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[FSDP] Fix summon_full_params test #74456

[FSDP] Fix summon_full_params test #74456

Uh oh!

Conversation

rohan-varma commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 21, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

mrshenli Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 23, 2022

Choose a reason for hiding this comment

Uh oh!

mrshenli Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 23, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rohan-varma commented Mar 21, 2022 •

edited

Loading

facebook-github-bot commented Mar 21, 2022 •

edited

Loading