[FSDP()][27/N] Add forward hook registration #88040

awgu · 2022-10-29T21:24:17Z

Stack from ghstack:

[FSDP()][Easy] Make fully_shard() only FULL_SHARD #88260 [FSDP()][Easy] Make fully_shard() only FULL_SHARD
[FSDP()] Have fully_shard() abide by @contract! #88235 [FSDP()] Have fully_shard() abide by @contract!
[FSDP()][Easy] Rename _State to _FSDPState #88234 [FSDP()][Easy] Rename _State to _FSDPState
[FSDP()] Rename to fully_shard() and move to _composable/ #88233 [FSDP()] Rename to fully_shard() and move to _composable/
[FSDP][Easy] Remove unneeded TrainingState transition #88232 [FSDP][Easy] Remove unneeded TrainingState transition
[FSDP] Rename unflat_param_name -> fqn for consistency #88123 [FSDP] Rename unflat_param_name -> fqn for consistency
[FSDP] Simplify _get_buffer_names() #88122 [FSDP] Simplify _get_buffer_names()
[FSDP] Remove unneeded torch.no_grad() context when offloading to CPU #88121 [FSDP] Remove unneeded torch.no_grad() context when offloading to CPU
[FSDP][Docs] Add note mentioning rate limiter for backward prefetch #88120 [FSDP][Docs] Add note mentioning rate limiter for backward prefetch
[FSDP()][27/N] Add forward hook registration #88040 [FSDP()][27/N] Add forward hook registration
[FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward() #87941 [FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward()
[FSDP()][25/N] Add _post_forward_reshard() #87940 [FSDP()][25/N] Add _post_forward_reshard()
[FSDP()][24/N] Refactor _lazy_init() #87939 [FSDP()][24/N] Refactor _lazy_init()
[FSDP()][23/N] Refactor handle attr initialization #87938 [FSDP()][23/N] Refactor handle attr initialization
[FSDP()][21/N] Refactor and fix _cast_buffers() #87935 [FSDP()][21/N] Refactor and fix _cast_buffers()
[FSDP] Rename dtype to buffer_name_to_dtype #87934 [FSDP] Rename dtype to buffer_name_to_dtype
[FSDP] Remove device arg from _cast_buffers() #87933 [FSDP] Remove device arg from _cast_buffers()
[FSDP()][20/N][Easy] Move functions in file #87932 [FSDP()][20/N][Easy] Move functions in file
[FSDP()][18/N] Refactor pre_forward_unshard() #87931 [FSDP()][18/N] Refactor pre_forward_unshard()
[FSDP()][17/N] Refactor _fsdp_root_pre_forward() #87930 [FSDP()][17/N] Refactor _fsdp_root_pre_forward()
[FSDP()][16/N] Refactor post-forward/pre-backward #87929 [FSDP()][16/N] Refactor post-forward/pre-backward
[FSDP()][15/N] Refactor _init_streams() #87928 [FSDP()][15/N] Refactor _init_streams()
[FSDP()][14/N] Refactor pre-forward/post-backward #87927 [FSDP()][14/N] Refactor pre-forward/post-backward

This PR adds the forward hook registration to composable FSDP and adds a unit test for the runtime.

[ghstack-poisoned]

pytorch-bot · 2022-10-29T21:24:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88040

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

NVIDIA driver issue

✅ No Failures

As of commit d930488:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 019499e Pull Request resolved: pytorch#88040

[ghstack-poisoned]

ghstack-source-id: 35fae68 Pull Request resolved: pytorch#88040

rohan-varma

couple of minor questions, thanks for adding this!

rohan-varma · 2022-11-02T16:15:51Z

test/distributed/fsdp/test_composable_fsdp.py

+    def test_training(self):
+        """Tests training (forward, backward, optimizer)."""
+        device = torch.device("cuda")
+        local_model = Model(device=device)


do we have support for composable FSDP + meta device? Is there a source of truth where we can find the feature set covered by composable?

do we have support for composable FSDP + meta device?

I believe there should be because the composable FSDP constructor includes the same module materialization logic as the normal FSDP constructor.

Is there a source of truth where we can find the feature set covered by composable?

This is difficult to document right now since we are still prototyping. As I continue testing and thinking about the design, I may realize some sharp edges that prevent feature parity. I will try to stabilize soon. (same for use_orig_params=True)

rohan-varma · 2022-11-02T16:19:06Z

torch/distributed/fsdp/_runtime_utils.py

        elif _handles_key:
-            _assert_in_training_states(state, [TrainingState.IDLE])
+            allowed_states = [TrainingState.IDLE]
+            if _is_composable(state):


why are allowed states different in composable vs non-composable?

First, note that this is TrainingState, which is per state: _FSDPState object, and not HandleTrainingState, which is per FlatParamHandle / FlatParameter. (_FSDPState is Union[_State, FullyShardedDataParallel], where _State is from torch/distributed/_composable/contract.py).

For composable, state represents the local FSDP root (without wrapping). Upon the first FlatParameter's pre-backward hook, the state will transition to FORWARD_BACKWARD. For any subsequent FlatParameter's pre-backward hooks, state.training_state will already be in FORWARD_BACKWARD. However, each FlatParamHandle's training state transitions like you expect (i.e. IDLE -> BACKWARD_PRE here).

This shows why I had to refactor TrainingState in an earlier PR. We have to stratify to accommodate the difference between state: _FSDPState and FlatParamHandle / FlatParameter.

This PR adds the forward hook registration to composable FSDP and adds a unit test for the runtime. Pull Request resolved: pytorch#88040 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma

[FSDP()][27/N] Add forward hook registration

6e0a7ea

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 29, 2022 21:24

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 29, 2022

awgu mentioned this pull request Oct 29, 2022

[FSDP()][3/N] Refactor public APIs #87917

Closed

awgu mentioned this pull request Nov 1, 2022

[FSDP()][Easy] Make fully_shard() only FULL_SHARD #88260

Closed

Update on "[FSDP()][27/N] Add forward hook registration"

3a792c7

[ghstack-poisoned]

Update on "[FSDP()][27/N] Add forward hook registration"

b5d0603

[ghstack-poisoned]

awgu pushed a commit to awgu/pytorch that referenced this pull request Nov 2, 2022

[FSDP()][27/N] Add forward hook registration

cc3f3d0

ghstack-source-id: 019499e Pull Request resolved: pytorch#88040

Update on "[FSDP()][27/N] Add forward hook registration"

d930488

[ghstack-poisoned]

awgu pushed a commit to awgu/pytorch that referenced this pull request Nov 2, 2022

[FSDP()][27/N] Add forward hook registration

56a3495

ghstack-source-id: 35fae68 Pull Request resolved: pytorch#88040

zhaojuanmao approved these changes Nov 2, 2022

View reviewed changes

rohan-varma approved these changes Nov 2, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 2, 2022

pytorchmergebot closed this in 32d22ed Nov 2, 2022

facebook-github-bot deleted the gh/awgu/176/head branch June 8, 2023 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP()][27/N] Add forward hook registration #88040

[FSDP()][27/N] Add forward hook registration #88040

Uh oh!

awgu commented Oct 29, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 29, 2022 •

edited

Loading

Uh oh!

rohan-varma left a comment

Uh oh!

rohan-varma Nov 2, 2022

Uh oh!

awgu Nov 2, 2022

Uh oh!

rohan-varma Nov 2, 2022

Uh oh!

awgu Nov 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP()][27/N] Add forward hook registration #88040

[FSDP()][27/N] Add forward hook registration #88040

Uh oh!

Conversation

awgu commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88040

❗ 1 Active SEVs

✅ No Failures

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

awgu commented Oct 29, 2022 •

edited

Loading

pytorch-bot bot commented Oct 29, 2022 •

edited

Loading