[FSDP()][21/N] Refactor and fix `_cast_buffers()` #87935

awgu · 2022-10-27T22:22:09Z

Stack from ghstack:

[FSDP()][Easy] Make fully_shard() only FULL_SHARD #88260 [FSDP()][Easy] Make fully_shard() only FULL_SHARD
[FSDP()] Have fully_shard() abide by @contract! #88235 [FSDP()] Have fully_shard() abide by @contract!
[FSDP()][Easy] Rename _State to _FSDPState #88234 [FSDP()][Easy] Rename _State to _FSDPState
[FSDP()] Rename to fully_shard() and move to _composable/ #88233 [FSDP()] Rename to fully_shard() and move to _composable/
[FSDP][Easy] Remove unneeded TrainingState transition #88232 [FSDP][Easy] Remove unneeded TrainingState transition
[FSDP] Rename unflat_param_name -> fqn for consistency #88123 [FSDP] Rename unflat_param_name -> fqn for consistency
[FSDP] Simplify _get_buffer_names() #88122 [FSDP] Simplify _get_buffer_names()
[FSDP] Remove unneeded torch.no_grad() context when offloading to CPU #88121 [FSDP] Remove unneeded torch.no_grad() context when offloading to CPU
[FSDP][Docs] Add note mentioning rate limiter for backward prefetch #88120 [FSDP][Docs] Add note mentioning rate limiter for backward prefetch
[FSDP()][27/N] Add forward hook registration #88040 [FSDP()][27/N] Add forward hook registration
[FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward() #87941 [FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward()
[FSDP()][25/N] Add _post_forward_reshard() #87940 [FSDP()][25/N] Add _post_forward_reshard()
[FSDP()][24/N] Refactor _lazy_init() #87939 [FSDP()][24/N] Refactor _lazy_init()
[FSDP()][23/N] Refactor handle attr initialization #87938 [FSDP()][23/N] Refactor handle attr initialization
[FSDP()][21/N] Refactor and fix _cast_buffers() #87935 [FSDP()][21/N] Refactor and fix _cast_buffers()
[FSDP] Rename dtype to buffer_name_to_dtype #87934 [FSDP] Rename dtype to buffer_name_to_dtype
[FSDP] Remove device arg from _cast_buffers() #87933 [FSDP] Remove device arg from _cast_buffers()
[FSDP()][20/N][Easy] Move functions in file #87932 [FSDP()][20/N][Easy] Move functions in file
[FSDP()][18/N] Refactor pre_forward_unshard() #87931 [FSDP()][18/N] Refactor pre_forward_unshard()
[FSDP()][17/N] Refactor _fsdp_root_pre_forward() #87930 [FSDP()][17/N] Refactor _fsdp_root_pre_forward()
[FSDP()][16/N] Refactor post-forward/pre-backward #87929 [FSDP()][16/N] Refactor post-forward/pre-backward
[FSDP()][15/N] Refactor _init_streams() #87928 [FSDP()][15/N] Refactor _init_streams()
[FSDP()][14/N] Refactor pre-forward/post-backward #87927 [FSDP()][14/N] Refactor pre-forward/post-backward

This PR refactors and fixes _cast_buffers().

Before
Buffers were not correctly cast back to their original dtypes for submodules when using buffer mixed precision.

_cast_buffers(recurse=False) incorrectly casts all buffers, including those in submodules. This is because of this outer loop over self.modules():

pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py

Line 700 in c40033b

for module in self.modules():

There was a unit test that checked that buffers were cast as expected (test_mixed_precision_e2e_full_shard()). The unit test coincidentally passed because all modules shared the same buffer name "buffer". In _cast_buffers(), the dict mapping buffer name to original dtype is populated lazily (during _lazy_init()). However, the keys are unprefixed:

pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py

Lines 712 to 717 in c40033b

    
           for name, buf in module.named_buffers(recurse=False): 
        
               if buf is None: 
        
                   continue 
        
               buf = buf.to(device=device or self.compute_device) 
        
               if name not in self._buffer_name_to_orig_dtype: 
        
                   self._buffer_name_to_orig_dtype[name] = buf.dtype

Thus, even though (1) _cast_buffers(recurse=False) was only called on the root and (2) self._buffer_name_to_orig_dtype had unprefixed names as keys, the unit test still passed because (1) _cast_buffers() still looped over all buffers despite recurse=False and (2) all submodules' buffers were named "buffer" and had the same original and low-precision dtypes and hence were cast correctly.

If we change each submodule to have its own distinct buffer name, then the unit test fails. This PR makes such a change to showcase the progression granted by this PR.

After
This PR separates _cast_buffers() into three methods: _get_buffers_and_dtypes_for_computation(), _get_buffers_and_dtypes_for_checkpoint(), and _cast_buffers_to_dtype_and_device(). This is to separate the different use cases (casting for computation and casting for checkpointing) and the corresponding code paths. Plus, the signature for _cast_buffers_to_dtype_and_device() makes it clear exactly what buffers are being cast and to what dtype.

Both _get_...() functions assume that they are called on the root only for now. This coincides with the construction of _buffer_name_to_orig_dtype in the FSDP constructor, which loops over all submodules. (This means that for non-root modules, their _buffer_name_to_orig_dtype is populated but not used.) The dict's keys are clean since the buffer cast to original dtype happens in a summon_full_params() context, which cleans the names.

Follow-Ups

We can try to move _get_buffers_and_dtypes_for_checkpoint() into _state_dict_utils.py in a follow-up.
We may want to move to per-module buffer casting (i.e. do not have the root module cast for all submodules).

[ghstack-poisoned]

pytorch-bot · 2022-10-27T22:22:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87935

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 88d3c09:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…tation" [ghstack-poisoned]

ghstack-source-id: 6d46f28 Pull Request resolved: pytorch#87935

This PR refactors and fixes `_cast_buffers()`. **Before** Buffers were not correctly cast back to their original dtypes for submodules when using buffer mixed precision. - `_cast_buffers(recurse=False)` incorrectly casts all buffers, including those in submodules. This is because of this outer loop over `self.modules()`: https://github.com/pytorch/pytorch/blob/c40033be162db0f94d37e7ccbd2a89d67f8b8e47/torch/distributed/fsdp/fully_sharded_data_parallel.py#L700 - There was a unit test that checked that buffers were cast as expected (`test_mixed_precision_e2e_full_shard()`). The unit test _coincidentally_ passed because all modules shared the same buffer name `"buffer"`. In `_cast_buffers()`, the `dict` mapping buffer name to original dtype is populated lazily (during `_lazy_init()`). However, the keys are unprefixed: https://github.com/pytorch/pytorch/blob/c40033be162db0f94d37e7ccbd2a89d67f8b8e47/torch/distributed/fsdp/fully_sharded_data_parallel.py#L712-L717 - Thus, even though (1) `_cast_buffers(recurse=False)` was only called on the root and (2) `self._buffer_name_to_orig_dtype` had unprefixed names as keys, the unit test still passed because (1) `_cast_buffers()` still looped over all buffers despite `recurse=False` and (2) all submodules' buffers were named `"buffer"` and had the same original and low-precision dtypes and hence were cast correctly. If we change each submodule to have its own distinct buffer name, then the unit test fails. This PR makes such a change to showcase the progression granted by this PR. **After** This PR separates `_cast_buffers()` into three methods: `_get_buffers_and_dtypes_for_computation()`, `_get_buffers_and_dtypes_for_checkpoint()`, and `_cast_buffers_to_dtype_and_device()`. This is to separate the different use cases (casting for computation and casting for checkpointing) and the corresponding code paths. Plus, the signature for `_cast_buffers_to_dtype_and_device()` makes it clear exactly what buffers are being cast and to what dtype. Both `_get_...()` functions assume that they are called on the root only for now. This coincides with the construction of `_buffer_name_to_orig_dtype` in the FSDP constructor, which loops over all submodules. (This means that for non-root modules, their `_buffer_name_to_orig_dtype` is populated but not used.) The `dict`'s keys are clean since the buffer cast to original dtype happens in a `summon_full_params()` context, which cleans the names. Note: We can try to move `_get_buffers_and_dtypes_for_checkpoint()` into `_state_dict_utils.py` in a follow-up. [ghstack-poisoned]

ghstack-source-id: 0633b82 Pull Request resolved: pytorch#87935

This PR refactors and fixes `_cast_buffers()`. **Before** Buffers were not correctly cast back to their original dtypes for submodules when using buffer mixed precision. - `_cast_buffers(recurse=False)` incorrectly casts all buffers, including those in submodules. This is because of this outer loop over `self.modules()`: https://github.com/pytorch/pytorch/blob/c40033be162db0f94d37e7ccbd2a89d67f8b8e47/torch/distributed/fsdp/fully_sharded_data_parallel.py#L700 - There was a unit test that checked that buffers were cast as expected (`test_mixed_precision_e2e_full_shard()`). The unit test _coincidentally_ passed because all modules shared the same buffer name `"buffer"`. In `_cast_buffers()`, the `dict` mapping buffer name to original dtype is populated lazily (during `_lazy_init()`). However, the keys are unprefixed: https://github.com/pytorch/pytorch/blob/c40033be162db0f94d37e7ccbd2a89d67f8b8e47/torch/distributed/fsdp/fully_sharded_data_parallel.py#L712-L717 - Thus, even though (1) `_cast_buffers(recurse=False)` was only called on the root and (2) `self._buffer_name_to_orig_dtype` had unprefixed names as keys, the unit test still passed because (1) `_cast_buffers()` still looped over all buffers despite `recurse=False` and (2) all submodules' buffers were named `"buffer"` and had the same original and low-precision dtypes and hence were cast correctly. If we change each submodule to have its own distinct buffer name, then the unit test fails. This PR makes such a change to showcase the progression granted by this PR. **After** This PR separates `_cast_buffers()` into three methods: `_get_buffers_and_dtypes_for_computation()`, `_get_buffers_and_dtypes_for_checkpoint()`, and `_cast_buffers_to_dtype_and_device()`. This is to separate the different use cases (casting for computation and casting for checkpointing) and the corresponding code paths. Plus, the signature for `_cast_buffers_to_dtype_and_device()` makes it clear exactly what buffers are being cast and to what dtype. Both `_get_...()` functions assume that they are called on the root only for now. This coincides with the construction of `_buffer_name_to_orig_dtype` in the FSDP constructor, which loops over all submodules. (This means that for non-root modules, their `_buffer_name_to_orig_dtype` is populated but not used.) The `dict`'s keys are clean since the buffer cast to original dtype happens in a `summon_full_params()` context, which cleans the names. **Follow-Ups** - We can try to move `_get_buffers_and_dtypes_for_checkpoint()` into `_state_dict_utils.py` in a follow-up. - We may want to move to per-module buffer casting (i.e. do not have the root module cast for all submodules). Pull Request resolved: pytorch#87935 Approved by: https://github.com/mrshenli

[FSDP()][21/N] Refactor _buffer_name_to_orig_dtype computation

934db0d

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 27, 2022 22:22

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 27, 2022

Update on "[FSDP()][21/N] Refactor _buffer_name_to_orig_dtype compu…

fba1dfb

…tation" [ghstack-poisoned]

Update on "[FSDP()][21/N] Refactor _buffer_name_to_orig_dtype compu…

e7857c1

…tation" [ghstack-poisoned]

Update on "[FSDP()][21/N] Refactor _buffer_name_to_orig_dtype compu…

50ab998

…tation" [ghstack-poisoned]

awgu changed the title ~~[FSDP()][21/N] Refactor _buffer_name_to_orig_dtype computation~~ [FSDP()][21/N] Refactor and fix _cast_buffers() Nov 1, 2022

awgu pushed a commit to awgu/pytorch that referenced this pull request Nov 1, 2022

[FSDP()][21/N] Refactor and fix _cast_buffers()

c8df6b9

ghstack-source-id: 6d46f28 Pull Request resolved: pytorch#87935

awgu mentioned this pull request Nov 1, 2022

[FSDP()][Easy] Make fully_shard() only FULL_SHARD #88260

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Nov 2, 2022

[FSDP()][21/N] Refactor and fix _cast_buffers()

453ea04

ghstack-source-id: 0633b82 Pull Request resolved: pytorch#87935

pytorchmergebot closed this in d172dcf Nov 2, 2022

facebook-github-bot deleted the gh/awgu/165/head branch June 8, 2023 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP()][21/N] Refactor and fix `_cast_buffers()` #87935

[FSDP()][21/N] Refactor and fix `_cast_buffers()` #87935

Uh oh!

awgu commented Oct 27, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 27, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	for name, buf in module.named_buffers(recurse=False):
	if buf is None:
	continue
	buf = buf.to(device=device or self.compute_device)
	if name not in self._buffer_name_to_orig_dtype:
	self._buffer_name_to_orig_dtype[name] = buf.dtype

[FSDP()][21/N] Refactor and fix _cast_buffers() #87935

[FSDP()][21/N] Refactor and fix _cast_buffers() #87935

Uh oh!

Conversation

awgu commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87935

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP()][21/N] Refactor and fix `_cast_buffers()` #87935

[FSDP()][21/N] Refactor and fix `_cast_buffers()` #87935

awgu commented Oct 27, 2022 •

edited

Loading

pytorch-bot bot commented Oct 27, 2022 •

edited

Loading