[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook #90660

awgu · 2022-12-11T16:33:29Z

Stack from ghstack:

[FSDP] Pre-allocate all grad tensors in default stream #90678 [FSDP] Pre-allocate all grad tensors in default stream
[FSDP] Clean up FlatParamHandle dtypes, post-backward hook #90660 [FSDP] Clean up FlatParamHandle dtypes, post-backward hook
[FSDP] Tighten post-bwd cast to reduce_dtype #90615 [FSDP] Tighten post-bwd cast to reduce_dtype
[FSDP][Easy] Move to _storage() in test file #90622 [FSDP][Easy] Move to _storage() in test file
[FSDP] Save _stream_to_name for debugging #90611 [FSDP] Save _stream_to_name for debugging
[Reland][FSDP] Another fix for DTensor, use_orig_params=True #90562 [Reland][FSDP] Another fix for DTensor, use_orig_params=True

This PR reworks the internal handling of parameter and gradient reduction mixed precision, cleans up the post-backward hook logic, and adds some minor changes to the communication hooks.

Overview
This PR addresses everything in #90657 except renaming keep_low_precision_grads to keep_grads_in_reduce_dtype since that is BC breaking. I recommend reading the issue before preceding.

For MixedPrecision(param_dtype, reduce_dtype, ...), the exact rule for parameter and gradient reduction mixed precision that we are following is:

If param_dtype is not None and reduce_dtype is None, then we infer reduce_dtype = param_dtype. Otherwise, we take param_dtype and reduce_dtype as is.

This PR enforces that, at the FlatParamHandle level, handle._config.fwd_bwd_param_dtype and handle._config.reduce_dtype are never None. The way to check if mixed precision is enabled is to compare against the original parameter dtype, which is now stored in handle._orig_param_dtype. It is no longer to check against None.

This avoids ambiguous cases such as when the user passes MixedPrecision(param_dtype=torch.float32). In that case, our existing implementation mistakenly thinks that parameter mixed precision is enabled and either relies on no-ops silently or errors (such as one case reported by MosaicML).

Additional Details

We remove FullyShardedDataParallel._mixed_precision_enabled_for_params, FullyShardedDataParallel._mixed_precision_enabled_for_reduce, and FullyShardedDataParallel._mixed_precision_keep_low_precision_grads since they are not used.
The unit test test_meta_device_with_mixed_precision() exercises a tricky edge case with meta device initialization, apply() (calling into summon_full_params()), and param_dtype=torch.float32 for a nested wrapping case, where each nested instance has parameters.
We include some minor fixes/improvements to the communication hook implementation.

Follow-Ups

We should get rid of HandleConfig and store its fields as attributes on FlatParamHandle directly.
Rename keep_low_precision_grads to keep_grads_in_reduce_dtype.

[ghstack-poisoned]

pytorch-bot · 2022-12-11T16:33:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90660

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e9aa3e0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d6108cd Pull Request resolved: #90660

[ghstack-poisoned]

ghstack-source-id: 572b81e Pull Request resolved: pytorch#90660

zhaojuanmao

Thanks!!

awgu · 2022-12-13T04:21:28Z

@pytorchbot merge

pytorchmergebot · 2022-12-13T04:23:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 9137cab Pull Request resolved: pytorch#90660

To make mixed precision precise internally, #90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. [ghstack-poisoned]

… FSDP init" Closes #90838. To make mixed precision precise internally, #90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. [ghstack-poisoned]

Closes #90838. To make mixed precision precise internally, #90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. [ghstack-poisoned]

… FSDP init" Closes #90838. To make mixed precision precise internally, #90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. [ghstack-poisoned]

Closes #90838. To make mixed precision precise internally, #90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. [ghstack-poisoned]

Closes #90838. To make mixed precision precise internally, #90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. Pull Request resolved: #91192 Approved by: https://github.com/zhaojuanmao

[FSDP] Clean up FlatParamHandle dtypes, post-backward hook

e10fdb7

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma, wanchaol and zhaojuanmao as code owners December 11, 2022 16:33

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 11, 2022

awgu pushed a commit that referenced this pull request Dec 11, 2022

[FSDP] Clean up FlatParamHandle dtypes, post-backward hook

3720f12

ghstack-source-id: d6108cd Pull Request resolved: #90660

Update on "[FSDP] Clean up FlatParamHandle dtypes, post-backward hook"

9ec1ac1

[ghstack-poisoned]

Update on "[FSDP] Clean up FlatParamHandle dtypes, post-backward hook"

18b0c81

[ghstack-poisoned]

awgu mentioned this pull request Dec 12, 2022

[FSDP] Pre-allocate all grad tensors in default stream #90677

Closed

Update on "[FSDP] Clean up FlatParamHandle dtypes, post-backward hook"

e9aa3e0

[ghstack-poisoned]

awgu pushed a commit to awgu/pytorch that referenced this pull request Dec 12, 2022

[FSDP] Clean up FlatParamHandle dtypes, post-backward hook

4c6db0a

ghstack-source-id: 572b81e Pull Request resolved: pytorch#90660

awgu mentioned this pull request Dec 12, 2022

[FSDP] Pre-allocate all grad tensors in default stream #90678

Closed

awgu added the topic: not user facing topic category label Dec 12, 2022

zhaojuanmao approved these changes Dec 13, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 13, 2022

pytorchmergebot added the Merged label Dec 13, 2022

pytorchmergebot closed this in fc42951 Dec 13, 2022

awgu mentioned this pull request Dec 13, 2022

[FSDP][1/N] fully_shard state dict #90766

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Dec 13, 2022

[FSDP] Clean up FlatParamHandle dtypes, post-backward hook

6ae9c5a

ghstack-source-id: 9137cab Pull Request resolved: pytorch#90660

awgu mentioned this pull request Dec 20, 2022

[FSDP] Re-support model dtype change after FSDP init #91192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook #90660

[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook #90660

awgu commented Dec 11, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 11, 2022 •

edited

Loading

Uh oh!

zhaojuanmao left a comment

Uh oh!

awgu commented Dec 13, 2022

Uh oh!

pytorchmergebot commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP] Clean up FlatParamHandle dtypes, post-backward hook #90660

[FSDP] Clean up FlatParamHandle dtypes, post-backward hook #90660

Conversation

awgu commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90660

✅ No Failures

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

awgu commented Dec 13, 2022

Uh oh!

pytorchmergebot commented Dec 13, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook #90660

[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook #90660

awgu commented Dec 11, 2022 •

edited

Loading

pytorch-bot bot commented Dec 11, 2022 •

edited

Loading