[FSDP] summon offload to CPU #73904

rohan-varma · 2022-03-08T05:19:08Z

Stack from ghstack (oldest at bottom):

Implement ability to offload full params to CPU in summon_full_params.

Differential Revision: D34707801

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

pytorch-bot · 2022-03-08T05:19:12Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/354e5581e52d5d648a76dbaac2b29702daf7407c/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-manywheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
windows-binary-libtorch-debug	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-libtorch-release	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-wheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-03-08T05:19:18Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73904
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit b21c4b2 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) ghstack-source-id: 150769189 Pull Request resolved: #73904

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 150794027 Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/)

fegin

LGTM, one comment regarding the warning of the combination of rank_0 + cpu_offload

fegin · 2022-03-08T17:26:05Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+            offload_to_cpu (bool, optional): If ``True``, full parameters are
+                offloaded to CPU. Note that this offloading currently only
+                occurs if the parameter is sharded (which is only not the case
+                for world_size = 1).


In some rare cases, users may want to actually have multiple CPU copies, but we should add a note, warning or check to warn users if offload_to_cpu is True but rank0_only is False. At least users should be fully aware the potential CPU OOM issue that can happen.

sounds good, added a warning

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 150949235 Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/)

rohan-varma · 2022-03-09T20:21:36Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                            )
+                            self._update_p_data(
+                                p, output_tensor=p._full_param_padded,  # type: ignore[attr-defined]
+                            )


@zhaojuanmao just want to double check this offloading approach since it's a bit different than regular parameter offload.

Here we have full_param_padded materialized, and p.data points to it. So to offload, we:

offload full param padded

Update p.data to point to the offloaded full param padded.

And we restore appropriately when exiting the context.

If we directly offloaded p.data, it would still hold onto GPU memory because of full param padded.

yeah, this looks good to me!

zhaojuanmao

left some minor comments, looks great!

in terms of enabling these configs in state_dict() APIs, could we make rank0_only as default setting and cpu_offloading as optional?

zhaojuanmao · 2022-03-09T19:51:52Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                                        )
+
                        if writeback:
                            self._write_back_current_shard()


just realized when fsdp main API cpu offloading is enabled, if writeback=True, it will make parameters point to local shard in GPUs, not local shards in CPUs. Which looks like a bug? maybe we can restrict writeback for non cpu offloading only?

I'll look into this and file an issue.

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 151311236 Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/)

Summary: Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 151311236 Test Plan: CI Reviewed By: fegin Differential Revision: D34707801 fbshipit-source-id: efc9c568037ddfeb9ddaff45a1a0388c6bc85825

github-actions · 2022-03-15T04:58:43Z

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

[FSDP] summon offload to CPU

354e558

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

rohan-varma requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87 and zhaojuanmao as code owners March 8, 2022 05:19

pytorch-bot bot added the ciflow/default label Mar 8, 2022

facebook-github-bot added the cla signed label Mar 8, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 8, 2022

rohan-varma mentioned this pull request Mar 8, 2022

[FSDP] Option to summon on rank 0 only #73903

Closed

Update on "[FSDP] summon offload to CPU"

2c5cc92

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

fegin approved these changes Mar 8, 2022

View reviewed changes

Update on "[FSDP] summon offload to CPU"

8c9ec7d

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

rohan-varma commented Mar 9, 2022

View reviewed changes

zhaojuanmao reviewed Mar 9, 2022

View reviewed changes

fegin approved these changes Mar 10, 2022

View reviewed changes

rohan-varma mentioned this pull request Mar 14, 2022

[FSDP] Investigate summon_full_params with CPU offloading #74166

Closed

Update on "[FSDP] summon offload to CPU"

b21c4b2

Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]

pytorchmergebot closed this in c2aba0f Mar 15, 2022

facebook-github-bot deleted the gh/rohan-varma/519/head branch March 18, 2022 14:17

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

[FSDP] summon offload to CPU #73904

[FSDP] summon offload to CPU #73904

Uh oh!

Conversation

rohan-varma commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 8, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 14, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohan-varma commented Mar 8, 2022 •

edited

Loading

facebook-github-bot commented Mar 8, 2022 •

edited

Loading