[FSDP] Introduce `ModuleWrapPolicy` for simplicity #88450

awgu · 2022-11-03T21:03:39Z

Stack from ghstack:

[Dynamo][FSDP] Migrate to ModuleWrapPolicy #88453 [Dynamo][FSDP] Migrate to ModuleWrapPolicy
[FSDP] Introduce ModuleWrapPolicy for simplicity #88450 [FSDP] Introduce ModuleWrapPolicy for simplicity

BC Breaking Change
This renames unwrapped_params to nonwrapped_numel. I prefer nonwrapped over unwrapped because "unwrap" suggests that some wrapping has been undone. I prefer numel over params because that is unit of measurement; I think we should keep "params" to refer to nn.Parameters themselves.

This only breaks anything that passes unwrapped_params as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our pytorch code).

In a follow-up, I want to rename min_num_params to min_nonwrapped_numel in size_based_auto_wrap_policy, which is also BC breaking. Again, this is to differentiate between "params" being nn.Parameters and "numel" being the unit for param.numel().

Overview
This PR introduces ModuleWrapPolicy as a lightweight layer over the existing transformer_auto_wrap_policy. The most common auto wrapping paradigm is:

module_classes: Set[Type[nn.Module]] = ...
auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls=module_classes,
)
fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)

Now, users can instead write:

auto_wrap_policy = ModuleWrapPolicy(module_classes)
fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)

This hides the unused arguments expected from the callable (recurse and unwrapped_params/nonwrapped_numel).

ModuleWrapPolicy inherits from an abstract base class FSDPPolicy that expects a policy property. This decouples the construct of such FSDPPolicy classes and their actual policy, which must abide by the _recursive_wrap interface. Any existing auto wrap policy can be rewritten as a class that inherits from FSDPPolicy, so this approach is fully backward compatible from a functionality perspective.

I call this base class FSDPPolicy to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing FlatParameters, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in fully_shard() to simply policy instead of auto_wrap_policy.

This PR migrates usages of transformer_auto_wrap_policy within our unit test suite to ModuleWrapPolicy as much as possible.

[ghstack-poisoned]

pytorch-bot · 2022-11-03T21:03:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88450

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cb44a77:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

**BC Breaking Change** This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap" suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves. This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code). **Overview** This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is: ``` module_classes: Set[Type[nn.Module]] = ... auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls=module_classes, ) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` Now, users can instead write: ``` auto_wrap_policy = ModuleWrapPolicy(module_classes) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`). `ModuleWrapPolicy` inherits from an abstract base class `AutoWrapPolicy` that expects an `auto_wrap_policy` property. This decouples the construct of such `AutoWrapPolicy` classes and their actual `auto_wrap_policy`, which must abide by the `_recursive_wrap` interface. This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible. [ghstack-poisoned]

ghstack-source-id: 1c8e617 Pull Request resolved: #88450

**BC Breaking Change** This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap" suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves. This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code). **Overview** This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is: ``` module_classes: Set[Type[nn.Module]] = ... auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls=module_classes, ) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` Now, users can instead write: ``` auto_wrap_policy = ModuleWrapPolicy(module_classes) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`). `ModuleWrapPolicy` inherits from an abstract base class `AutoWrapPolicy` that expects an `auto_wrap_policy` property. This decouples the construct of such `AutoWrapPolicy` classes and their actual `auto_wrap_policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `AutoWrapPolicy`, so this approach is fully backward compatible from a functionality perspective. This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible. [ghstack-poisoned]

**BC Breaking Change** This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap" suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves. This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code). In a follow-up, I want to rename `min_num_params` to `min_nonwrapped_numel` in `size_based_auto_wrap_policy`, which is also BC breaking. Again, this is to differentiate between "params" being `nn.Parameter`s and "numel" being the unit for `param.numel()`. **Overview** This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is: ``` module_classes: Set[Type[nn.Module]] = ... auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls=module_classes, ) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` Now, users can instead write: ``` auto_wrap_policy = ModuleWrapPolicy(module_classes) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`). `ModuleWrapPolicy` inherits from an abstract base class `FSDPPolicy` that expects a `policy` property. This decouples the construct of such `FSDPPolicy` classes and their actual `policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `FSDPPolicy`, so this approach is fully backward compatible from a functionality perspective. I call this base class `FSDPPolicy` to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing `FlatParameter`s, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in `fully_shard()` to simply `policy` instead of `auto_wrap_policy`. This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible. [ghstack-poisoned]

zhaojuanmao · 2022-11-11T18:25:39Z

torch/distributed/_composable/fully_shard.py

    mixed_precision: Optional[MixedPrecision] = None,
    cpu_offload: Optional[CPUOffload] = None,
-    auto_wrap_policy: Optional[Callable] = None,
+    policy: Optional[_FSDPPolicy] = None,


let's still keep the 'auto_wrap_policy' name for now? feel policy is too general, also it is a big BC change

This is composable FSDP. In my understanding, we should be able to change the constructor?

I wanted policy to be general because we can configure FSDP this way. This can be an entry point for different flavors of FSDP. One option to enable tensor shape preservation may be via policy.

The wrapper FullyShardedDataParallel still calls it auto_wrap_policy in its constructor.

oh I see, that sounds good then.

awgu · 2022-11-11T19:50:40Z

I checked internal code. There is no code passing unwrapped_params as a kwarg, so renaming it should not have any breakages.

awgu · 2022-11-11T20:04:19Z

@pytorchbot merge

pytorchmergebot · 2022-11-11T20:06:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-11T20:06:18Z

Merge failed

Reason: 1 additional jobs have failed, first few of them are: TorchBench CI (pytorch-linux-py3.8-cu116)

Details for Dev Infra team

Raised by workflow job

awgu · 2022-11-11T21:01:47Z

@pytorchbot merge

pytorchmergebot · 2022-11-11T21:03:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-11T21:03:37Z

Merge failed

Reason: 1 additional jobs have failed, first few of them are: TorchBench CI (pytorch-linux-py3.8-cu116)

Details for Dev Infra team

Raised by workflow job

awgu · 2022-11-11T22:39:24Z

All tests are passing. Only TorchBench CI (pytorch-linux-py3.8-cu116) is skipped (not failing). However, merging is failing due to the skipped test.

awgu · 2022-11-11T22:40:00Z

@pytorchbot merge -f "TorchBench CI (pytorch-linux-py3.8-cu116) is skipped but being incorrectly treated as failed"

pytorchmergebot · 2022-11-11T22:41:49Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-11T22:41:54Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x e410e8dcfe12b2a70f7143df05e110a37992046d returned non-zero exit code 1

Auto-merging test/distributed/fsdp/test_fsdp_state_dict.py
CONFLICT (content): Merge conflict in test/distributed/fsdp/test_fsdp_state_dict.py
Auto-merging torch/distributed/fsdp/_init_utils.py
Auto-merging torch/distributed/fsdp/fully_sharded_data_parallel.py
error: could not apply e410e8dcfe... [FSDP] Introduce `ModuleWrapPolicy` for simplicity
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

**BC Breaking Change** This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap" suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves. This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code). In a follow-up, I want to rename `min_num_params` to `min_nonwrapped_numel` in `size_based_auto_wrap_policy`, which is also BC breaking. Again, this is to differentiate between "params" being `nn.Parameter`s and "numel" being the unit for `param.numel()`. **Overview** This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is: ``` module_classes: Set[Type[nn.Module]] = ... auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls=module_classes, ) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` Now, users can instead write: ``` auto_wrap_policy = ModuleWrapPolicy(module_classes) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`). `ModuleWrapPolicy` inherits from an abstract base class `FSDPPolicy` that expects a `policy` property. This decouples the construct of such `FSDPPolicy` classes and their actual `policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `FSDPPolicy`, so this approach is fully backward compatible from a functionality perspective. I call this base class `FSDPPolicy` to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing `FlatParameter`s, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in `fully_shard()` to simply `policy` instead of `auto_wrap_policy`. This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible. [ghstack-poisoned]

awgu · 2022-11-12T03:00:42Z

@pytorchbot merge

pytorchmergebot · 2022-11-12T03:03:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 3142eae Pull Request resolved: pytorch#88450

**BC Breaking Change** This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap" suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves. This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code). In a follow-up, I want to rename `min_num_params` to `min_nonwrapped_numel` in `size_based_auto_wrap_policy`, which is also BC breaking. Again, this is to differentiate between "params" being `nn.Parameter`s and "numel" being the unit for `param.numel()`. **Overview** This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is: ``` module_classes: Set[Type[nn.Module]] = ... auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls=module_classes, ) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` Now, users can instead write: ``` auto_wrap_policy = ModuleWrapPolicy(module_classes) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`). `ModuleWrapPolicy` inherits from an abstract base class `FSDPPolicy` that expects a `policy` property. This decouples the construct of such `FSDPPolicy` classes and their actual `policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `FSDPPolicy`, so this approach is fully backward compatible from a functionality perspective. I call this base class `FSDPPolicy` to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing `FlatParameter`s, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in `fully_shard()` to simply `policy` instead of `auto_wrap_policy`. This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible. Pull Request resolved: pytorch#88450 Approved by: https://github.com/zhaojuanmao

[FSDP] Introduce ModuleWrapPolicy for simplicity

16bfa1b

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma, yhcharles and zhaojuanmao as code owners November 3, 2022 21:03

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 3, 2022

awgu added topic: bc breaking topic category topic: improvements topic category labels Nov 3, 2022

awgu requested a review from fegin November 3, 2022 21:12

awgu pushed a commit that referenced this pull request Nov 3, 2022

[FSDP] Introduce ModuleWrapPolicy for simplicity

5d03298

ghstack-source-id: 1c8e617 Pull Request resolved: #88450

awgu mentioned this pull request Nov 3, 2022

[Dynamo][FSDP] Migrate to ModuleWrapPolicy #88453

Closed

awgu mentioned this pull request Nov 10, 2022

[FSDP][Perf] Do not call pad in no-padding case #88769

Closed

Andrew Gu added 3 commits November 10, 2022 18:45

zhaojuanmao approved these changes Nov 11, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 11, 2022

awgu pushed a commit to awgu/pytorch that referenced this pull request Nov 12, 2022

[FSDP] Introduce ModuleWrapPolicy for simplicity

86567fe

ghstack-source-id: 3142eae Pull Request resolved: pytorch#88450

pytorchmergebot added the Merged label Nov 12, 2022

pytorchmergebot closed this in d01bf1d Nov 12, 2022

This was referenced Nov 12, 2022

[FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD #88955

Closed

[FSDP] Rename transformer_auto_wrap_policy() #81050

Closed

facebook-github-bot deleted the gh/awgu/192/head branch June 8, 2023 15:26

[FSDP] Introduce ModuleWrapPolicy for simplicity #88450

[FSDP] Introduce ModuleWrapPolicy for simplicity #88450

Uh oh!

Conversation

awgu commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88450

✅ No Failures

Uh oh!

zhaojuanmao Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

awgu commented Nov 11, 2022

Uh oh!

awgu commented Nov 11, 2022

Uh oh!

pytorchmergebot commented Nov 11, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 11, 2022

Merge failed

Uh oh!

awgu commented Nov 11, 2022

Uh oh!

pytorchmergebot commented Nov 11, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 11, 2022

Merge failed

Uh oh!

awgu commented Nov 11, 2022

Uh oh!

awgu commented Nov 11, 2022

Uh oh!

pytorchmergebot commented Nov 11, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 11, 2022

Merge failed

Uh oh!

awgu commented Nov 12, 2022

Uh oh!

pytorchmergebot commented Nov 12, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP] Introduce `ModuleWrapPolicy` for simplicity #88450

[FSDP] Introduce `ModuleWrapPolicy` for simplicity #88450

awgu commented Nov 3, 2022 •

edited

Loading

pytorch-bot bot commented Nov 3, 2022 •

edited

Loading