[FSDP] Add `use_orig_params` #84911

awgu · 2022-09-13T00:23:06Z

Stack from ghstack:

[FSDP][2/N] Remove _fsdp_wrapped_module.flat_param #86122 [FSDP][2/N] Remove _fsdp_wrapped_module.flat_param
[FSDP][1/N] Retire FlattenParamsWrapper #86117 [FSDP][1/N] Retire FlattenParamsWrapper
[FSDP] Add initial summon_full_params(with_grads=True) #85738 [FSDP] Add initial summon_full_params(with_grads=True)
[FSDP] Add use_orig_params #84911 [FSDP] Add use_orig_params

Overview
This PR adds the option to use the original parameters via use_orig_params=True in the FSDP constructor.

This exposes the original parameters rather than the FlatParameters from named_parameters(), which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same FlatParameter to different parameter groups.
This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy.

For more detailed design explanation, refer to the Quip shared internally.

Follow-Ups
See 85831 (removing link to avoid spamming the issue whenever I update this PR).

test_fsdp_use_orig_params.py adds ~4 min 46 seconds to the TTS on the AWS cluster.

[ghstack-poisoned]

pytorch-bot · 2022-09-13T00:23:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84911

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Mac M1 queue build up

✅ No Failures

As of commit fa871ec:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: cc4a077 Pull Request resolved: #84911

[ghstack-poisoned]

rohan-varma

Could we have a brief PR description for reviewers to have context?

[ghstack-poisoned]

torch/distributed/fsdp/flat_param.py

test/distributed/fsdp/test_fsdp_use_orig_params.py

**Overview** This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor. - This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups. - This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy. For more detailed design explanation, refer to the Quip shared internally. **Follow-Ups** #85831 `test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster. [ghstack-poisoned]

**Overview** This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor. - This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups. - This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy. For more detailed design explanation, refer to the Quip shared internally. **Follow-Ups** See 85831 (removing link to avoid spamming the issue whenever I update this PR). `test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster. [ghstack-poisoned]

facebook-github-bot · 2022-10-04T00:23:04Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

rohan-varma · 2022-10-06T00:43:18Z

test/distributed/fsdp/test_fsdp_optim_state.py

+        def get_error_context():
+            error_regex = "Optimizer state checkpointing is not supported yet for `use_orig_params=True`"
+            return self.assertRaisesRegex(
+                expected_exception=NotImplementedError, expected_regex=error_regex


Can we file issues for this and all other unsupported features?

rohan-varma · 2022-10-06T00:47:11Z

test/distributed/fsdp/test_fsdp_use_orig_params.py

+        with self.assertRaisesRegex(RuntimeError, "Cannot writeback"):
+            # Change the gradient to a new one with 1 added to each dimension
+            # to force a shape mismatch when writing back
+            if self.rank == 0:


if/else can be condensed to:

param = getattr(fsdp, f"lin{rank}") lin_weight_shape = param.weight.shape param.weight = nn.Parameter(...) param.weight.grad = ....

rohan-varma

LGTM overall, thanks for working through this and the great attention to detail! Have 2 high level questions:

Shall we file follow-up issues for all unsupported features such as optimizer state checkpointing
Did we update all necessary documentation mentioning how to use this feature and the caveats/assumptions (such as the gradient writeback)?

rohan-varma · 2022-10-06T20:45:12Z

torch/distributed/fsdp/flat_param.py

+                flat_param.grad = flat_param._saved_grad_shard  # type: ignore[attr-defined]
+                if self._config.keep_low_precision_grads:
+                    assert flat_param.grad is not None  # mypy
+                    flat_param.grad.data = flat_param.grad.to(self._config.param_dtype)


So mixed precision doesn't work with CPU offload? Can we file an issue for this, seems pretty major?

torch/distributed/fsdp/fully_sharded_data_parallel.py

rohan-varma · 2022-10-06T20:48:55Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                self._use_orig_params
+                and self._handles
+                and self._handles[0].uses_sharded_strategy
+                and self._handles[0].is_sharded(self._handles[0].flat_param)


is is_sharded the canonical, recommended way to check if a param is in the sharded state? how about gradients?

In general, is it worth exposing docs on such methods to aid FSDP developers in the future who are looking to do these common sort of things?

Yes, is_sharded() can be the canonical way to check if a parameter or its gradient is currently sharded.

Do you have any suggestions for how to expose docs / what would help FSDP developers onboard more efficiently? The method is currently documented, but perhaps this is not salient enough.

torch/distributed/fsdp/fully_sharded_data_parallel.py

rohan-varma · 2022-10-06T20:50:13Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

    def _sharded_post_load_state_dict_hook(self, *args, **kwargs) -> None:
-        pass
+        if self._use_orig_params:
+            self._register_orig_params()


Is there guidance for FSDP developers on when they will need to call these methods?

I am still working to thoroughly understand how model state dict is implemented, namely the pre/postconditions of the pre/post save and load hooks, e.g. should FlatParameters be registered or should original parameters be registered, what do the prefixes and state dict keys look like at some point in the recursive call stack, etc.

I started trying to retire FlattenParamsWrapper but got quickly stymied by trying to understand those pre/postconditions. Maybe after I figure this out, I can help provide more internal documentation around these invariants.

rohan-varma · 2022-10-06T20:51:40Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

        if torch.cuda.is_available():
            torch.cuda.synchronize()
        self._lazy_init()
+        self._clear_grads_if_needed()


why do we need to do this for state_dict? Grads being none shouldn't matter there?

I am just using the major FSDP calls as an entry point to release gradient memory as early as possible. In the code crawl I did manually, I found that sometimes people will checkpoint after zero_grad(set_to_none=True) after the optimizer step.

rohan-varma · 2022-10-06T20:52:14Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

            return args, kwargs
        self._wait_for_previous_optim_step()
        self._needs_pre_forward_unshard.clear()
+        self._clear_grads_if_needed()


add a comment to mention we are calling this to enable correctness when user has set_grad_to_none and it is a sort of delayed set_to_none.

Also, is it worth documenting these writeback semantics clearly to the end user?

torch/distributed/fsdp/fully_sharded_data_parallel.py

rohan-varma · 2022-10-06T20:53:44Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

-        in_summon_full_params = self.training_state == TrainingState_.SUMMON_FULL_PARAMS
+        should_clean_name = (
+            self.training_state == TrainingState_.SUMMON_FULL_PARAMS
+            or self._use_orig_params


why do we need to clean the name when using use_orig_params? Shouldn't the param FQNs be exactly the local param names?

There is still nested wrapping (FSDP -> FPW -> module), so I think the names will be unclean.

rohan-varma · 2022-10-06T20:54:08Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+            for fsdp_module in FullyShardedDataParallel.fsdp_modules(model)
+        ):
+            raise NotImplementedError(
+                "Optimizer state checkpointing is not supported yet for `use_orig_params=True`"


have we filed issues for this?

**Overview** This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor. - This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups. - This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy. For more detailed design explanation, refer to the Quip shared internally. **Follow-Ups** See 85831 (removing link to avoid spamming the issue whenever I update this PR). `test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster. [ghstack-poisoned]

zhaojuanmao · 2022-10-07T14:03:04Z

torch/distributed/fsdp/flat_param.py

+                param = self.flat_param._params[i]  # type: ignore[index]
+                setattr(module, param_name, param)
+                param.data = view


feel we need a comment here, read for a while and seems that it intentionally exposes 'param' variable as the module's attr, so that the .data can be changed and points to changed data later on? also, the param_name is not registered as parameter here, why?

Correct. More precisely, we never delete the original parameter variable, and instead, FlatParamHandle always keep a reference to the original parameter variable.

Just for knowledge sharing, de-registration can happen in two ways:

delattr(module, param_name) where the parameter is stored as module.param_name.

module._parameters.pop(param_name).

The second way preserves that the parameter is present, i.e. the user may still access module.param_name; however, the parameter will not be returned by named_parameters().

Similarly, registration can happen in two ways:

setattr(module, param_name, param)

module._parameters[param_name] = param

Since we already setattr(), we do not need to do any further explicit registration.

oh I see, thanks for the clarification!

awgu · 2022-10-07T18:05:48Z

@pytorchbot merge

pytorchmergebot · 2022-10-07T18:07:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-07T18:08:03Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

zhaojuanmao

awesome PR! it handles so many subtle cases properly and carefully, especially like the idea to keep flat_param as the model's attribute while dynamically register it as model's parameter, it seems that this idea simplified the state_dict changes a lot

zhaojuanmao · 2022-10-07T20:59:46Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+        """
+        if not self._handles:
+            return
+        handle = self._handles[0]


are we assuming there is only one flat_param_handle per '_fsdp_wrapped_module' for now?

Yes, I have an assert a few lines above 😄

pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py

Lines 3118 to 3122 in fa871ec

p_assert(

len(self._handles) <= 1,

"Expects <=1 handle per FSDP instance; needs to be refactored "

"for >1 handle (e.g. non-recursive wrapping)"

)

Summary: **Overview** This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor. - This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups. - This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy. For more detailed design explanation, refer to the Quip shared internally. **Follow-Ups** See 85831 (removing link to avoid spamming the issue whenever I update this PR). `test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster. Pull Request resolved: #84911 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/be682befbc836a07d5d070bb569450429526a64b Reviewed By: seemethere Differential Revision: D40197130 Pulled By: seemethere fbshipit-source-id: fbf36e28bd06f49c8cb31febce86c26bc7ba7a34

ghstack-source-id: 5ee0687 Pull Request resolved: pytorch/pytorch#84911

ghstack-source-id: e936ff5 Pull Request resolved: pytorch/pytorch#84911

[FSDP] Add use_orig_params

9815a9d

[ghstack-poisoned]

awgu requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners September 13, 2022 00:23

This was referenced Sep 13, 2022

[FSDP] Remove forward_prefetch #84600

Closed

[FSDP] Subtest prefetching for test_fsdp_grad_acc.py #84601

Closed

awgu mentioned this pull request Sep 13, 2022

[FSDP][Easy] Minor cleanup #84761

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Sep 13, 2022

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 13, 2022

awgu pushed a commit that referenced this pull request Sep 13, 2022

[FSDP] Add use_orig_params

9d5b7ea

ghstack-source-id: cc4a077 Pull Request resolved: #84911

Update on "[FSDP] Add use_orig_params"

4fca564

[ghstack-poisoned]

This was referenced Sep 14, 2022

[FSDP] Add _set_flattened(); _is_flattened() #85038

Closed

[FSDP] Add FSDPExtensions for TP support #85039

Closed

Andrew Gu added 3 commits September 14, 2022 20:01

Update on "[FSDP] Add use_orig_params"

a2fe7ac

[ghstack-poisoned]

Update on "[FSDP] Add use_orig_params"

3d6c03f

[ghstack-poisoned]

Update on "[FSDP] Add use_orig_params"

074db44

[ghstack-poisoned]

rohan-varma reviewed Sep 15, 2022

View reviewed changes

awgu marked this pull request as draft September 15, 2022 00:42

Andrew Gu added 2 commits September 15, 2022 17:11

Update on "[FSDP] Add use_orig_params"

58a6dfe

[ghstack-poisoned]

Update on "[FSDP] Add use_orig_params"

f78d4f7

[ghstack-poisoned]

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 28, 2022

rohan-varma reviewed Sep 29, 2022

View reviewed changes

Andrew Gu added 3 commits September 30, 2022 18:49

This was referenced Oct 3, 2022

[FSDP][1/N] Retire FlattenParamsWrapper #86117

Closed

[FSDP][2/N] Remove _fsdp_wrapped_module.flat_param #86122

Closed

rohan-varma self-requested a review October 6, 2022 00:42

rohan-varma reviewed Oct 6, 2022

View reviewed changes

rohan-varma approved these changes Oct 6, 2022

View reviewed changes

zhaojuanmao reviewed Oct 7, 2022

View reviewed changes

pytorchmergebot added the Merged label Oct 7, 2022

pytorchmergebot closed this in be682be Oct 7, 2022

zhaojuanmao reviewed Oct 9, 2022

View reviewed changes

awgu mentioned this pull request Oct 10, 2022

[DDP] Add PackedSequence support when device_ids is specified #86614

Closed

awgu mentioned this pull request Oct 11, 2022

[FSDP][Easy] Rename _prefixed_param_names -> _fqns for consistency #86653

Closed

Rick0317 pushed a commit to Rick0317/pytorch that referenced this pull request Oct 18, 2022

[FSDP] Add use_orig_params

696f95d

ghstack-source-id: 5ee0687 Pull Request resolved: pytorch/pytorch#84911

Rick0317 pushed a commit to Rick0317/pytorch that referenced this pull request Oct 18, 2022

[FSDP] Add use_orig_params

9b14a79

ghstack-source-id: e936ff5 Pull Request resolved: pytorch/pytorch#84911

facebook-github-bot deleted the gh/awgu/95/head branch June 8, 2023 15:40

	p_assert(
	len(self._handles) <= 1,
	"Expects <=1 handle per FSDP instance; needs to be refactored "
	"for >1 handle (e.g. non-recursive wrapping)"
	)

[FSDP] Add use_orig_params #84911

[FSDP] Add use_orig_params #84911

Uh oh!

Conversation

awgu commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84911

❗ 1 Active SEVs

✅ No Failures

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Oct 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Oct 7, 2022

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge started

Uh oh!

github-actions bot commented Oct 7, 2022

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

[FSDP] Add `use_orig_params` #84911

[FSDP] Add `use_orig_params` #84911

awgu commented Sep 13, 2022 •

edited

Loading

pytorch-bot bot commented Sep 13, 2022 •

edited

Loading

awgu Oct 7, 2022 •

edited

Loading