[FSDP] Add initial `summon_full_params(with_grads=True)` #85738

awgu · 2022-09-27T18:01:47Z

Stack from ghstack:

[FSDP][2/N] Remove _fsdp_wrapped_module.flat_param #86122 [FSDP][2/N] Remove _fsdp_wrapped_module.flat_param
[FSDP][1/N] Retire FlattenParamsWrapper #86117 [FSDP][1/N] Retire FlattenParamsWrapper
[FSDP] Add initial summon_full_params(with_grads=True) #85738 [FSDP] Add initial summon_full_params(with_grads=True)
[FSDP] Add use_orig_params #84911 [FSDP] Add use_orig_params

This adds summon_full_params(with_grads=True) for use_orig_params=True and offload_to_cpu=False. Filling in the use_orig_params=False case requires some already-planned refactoring, and the offload_to_cpu=True case needs some additional work as well.

Adding this is helpful for debugging use_orig_params=True to make sure gradients are being updated correctly.

[ghstack-poisoned]

pytorch-bot · 2022-09-27T18:01:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85738

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 8122cbd:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. [ghstack-poisoned]

test/distributed/fsdp/test_fsdp_summon_full_params.py

rohan-varma

LGTM overall, some mostly minor questions / comments.

test/distributed/fsdp/test_fsdp_summon_full_params.py

rohan-varma · 2022-09-27T18:56:31Z

torch/distributed/fsdp/flat_param.py

+            self._check_sharded(flat_param.grad)
+            flat_param._saved_grad_shard = flat_param.grad  # type: ignore[attr-defined]
+            sharded_grad = flat_param._saved_grad_shard  # type: ignore[attr-defined]
+        dist._all_gather_base(padded_unsharded_grad, sharded_grad, self.process_group)


does it mean that the gradient is all zeros if the flat_param.grad = None on all ranks?

We discussed briefly offline, but there are two options:

(As in the PR currently) We only use a single all-gather collective per FlatParameter. In that case, if all ranks' sharded gradient is None, then the unsharded gradient is incorrectly torch.zeros(unsharded_size). If only some ranks' sharded gradients are None, then the unsharded gradient zeros those corresponding elements.

We use a preceding all-reduce collective per FlatParameter to indicate if each rank's sharded gradient is None or not. This solves the problem from 1.

Since summon_full_params(with_grads=True) is meant for debugging, I can see the argument for pursuing 2. I can change this in a follow-up PR and add/adjust unit tests accordingly.

torch/distributed/fsdp/flat_param.py

rohan-varma · 2022-09-27T18:59:04Z

torch/distributed/fsdp/flat_param.py

+            )
+            param = getattr(module, param_name)
+            param.grad = view
+        for i, (


is this for shared parameters? could you give an example for what this changes when there are shared params?

This just makes sure that each shared parameter's .grad is also populated. Without this, then if we had a model like

lin = nn.Linear(5, 5) lin.weight = lin.bias

then only one of (lin.weight, lin.bias) would get a .grad since only one of them is in accounted for in _param_infos / _params / etc.

torch/distributed/fsdp/fully_sharded_data_parallel.py

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. [ghstack-poisoned]

ghstack-source-id: 9c80c33 Pull Request resolved: #85738

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. [ghstack-poisoned]

ghstack-source-id: aee4fa1 Pull Request resolved: #85738

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. [ghstack-poisoned]

ghstack-source-id: 2ac67ef Pull Request resolved: #85738

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. [ghstack-poisoned]

ghstack-source-id: a9751ab Pull Request resolved: #85738

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. [ghstack-poisoned]

facebook-github-bot · 2022-10-04T00:10:42Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. [ghstack-poisoned]

awgu · 2022-10-07T18:08:55Z

@pytorchbot merge

pytorchmergebot · 2022-10-07T18:11:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-07T21:04:17Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

zhaojuanmao

Thanks a lot for adding this! It will improve debug ability a lot under summon_full_params mode for users

zhaojuanmao · 2022-10-09T18:33:26Z

torch/distributed/fsdp/flat_param.py

+            sharded_grad = torch.zeros_like(flat_param)  # type: ignore[attr-defined]
+        else:
+            self._check_sharded(flat_param.grad)
+            flat_param._saved_grad_shard = flat_param.grad  # type: ignore[attr-defined]


curious why we intentionally fill "flat_param._saved_grad_shard" here?

We already use _saved_grad_shard for saving the sharded gradient during the backward pass. Here, I use the same variable, now for a different purpose: saving the sharded gradient during summon_full_params(with_grads=True). This is just to avoid creating an entirely new variable also for saving the sharded gradient.

…5738) Summary: This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. Pull Request resolved: #85738 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/a95889ba7c1ecd8cb0f90507a6152cb035bcefd1 Reviewed By: seemethere Differential Revision: D40197192 Pulled By: seemethere fbshipit-source-id: 742ea641d7f005946e0714181c0a91167fe9fb9d

[FSDP] Add initial summon_full_params(with_grads=True)

a56f8ff

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners September 27, 2022 18:01

awgu mentioned this pull request Sep 27, 2022

[FSDP] Add FSDPExtensions for TP support #85039

Closed

awgu mentioned this pull request Sep 27, 2022

[FSDP] Add use_orig_params #84911

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Sep 27, 2022

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 27, 2022

rohan-varma reviewed Sep 27, 2022

View reviewed changes

test/distributed/fsdp/test_fsdp_summon_full_params.py Show resolved Hide resolved

rohan-varma approved these changes Sep 27, 2022

View reviewed changes

awgu pushed a commit that referenced this pull request Sep 27, 2022

[FSDP] Add initial summon_full_params(with_grads=True)

dabe788

ghstack-source-id: 9c80c33 Pull Request resolved: #85738

awgu pushed a commit that referenced this pull request Sep 28, 2022

[FSDP] Add initial summon_full_params(with_grads=True)

e842dda

ghstack-source-id: aee4fa1 Pull Request resolved: #85738

awgu pushed a commit that referenced this pull request Sep 28, 2022

[FSDP] Add initial summon_full_params(with_grads=True)

330b53a

ghstack-source-id: 2ac67ef Pull Request resolved: #85738

awgu pushed a commit that referenced this pull request Sep 30, 2022

[FSDP] Add initial summon_full_params(with_grads=True)

5ded554

ghstack-source-id: a9751ab Pull Request resolved: #85738

This was referenced Oct 3, 2022

[FSDP][1/N] Retire FlattenParamsWrapper #86117

Closed

[FSDP][2/N] Remove _fsdp_wrapped_module.flat_param #86122

Closed

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2022

pytorchmergebot added the Merged label Oct 7, 2022

pytorchmergebot closed this in a95889b Oct 7, 2022

zhaojuanmao reviewed Oct 9, 2022

View reviewed changes

awgu mentioned this pull request Oct 10, 2022

[DDP] Add PackedSequence support when device_ids is specified #86614

Closed

This was referenced Oct 11, 2022

[FSDP][Easy] Rename _prefixed_param_names -> _fqns for consistency #86653

Closed

Add an option to summon_full_params() for gathering full gradients in FSDP #76018

Closed

[FSDP] summon_full_params gradient access issues #73891

Closed

facebook-github-bot deleted the gh/awgu/110/head branch June 8, 2023 15:21

[FSDP] Add initial summon_full_params(with_grads=True) #85738

[FSDP] Add initial summon_full_params(with_grads=True) #85738

Uh oh!

Conversation

awgu commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85738

✅ No Failures, 1 Pending

Uh oh!

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohan-varma Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Oct 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rohan-varma Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Oct 4, 2022

Uh oh!

awgu commented Oct 7, 2022

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge started

Uh oh!

github-actions bot commented Oct 7, 2022

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Oct 9, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[FSDP] Add initial `summon_full_params(with_grads=True)` #85738

[FSDP] Add initial `summon_full_params(with_grads=True)` #85738

awgu commented Sep 27, 2022 •

edited

Loading

pytorch-bot bot commented Sep 27, 2022 •

edited

Loading