[FSDP()][2/N] Refactor training state #87916

awgu · 2022-10-27T22:19:37Z

Stack from ghstack:

[FSDP()][27/N] Add forward hook registration #87942 [FSDP()][27/N] Add forward hook registration
[FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward() #87941 [FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward()
[FSDP()][25/N] Add _post_forward_reshard() #87940 [FSDP()][25/N] Add _post_forward_reshard()
[FSDP()][24/N] Refactor _lazy_init() #87939 [FSDP()][24/N] Refactor _lazy_init()
[FSDP()][23/N] Refactor handle attr initialization #87938 [FSDP()][23/N] Refactor handle attr initialization
[FSDP] Simplify _reset_lazy_init() #87937 [FSDP] Simplify _reset_lazy_init()
[FSDP()][22/N] Refactor _cast_buffers() in _lazy_init() #87936 [FSDP()][22/N] Refactor _cast_buffers() in _lazy_init()
[FSDP()][21/N] Refactor and fix _cast_buffers() #87935 [FSDP()][21/N] Refactor _buffer_name_to_orig_dtype computation
[FSDP] Rename dtype to buffer_name_to_dtype #87934 [FSDP] Rename dtype to buffer_name_to_dtype
[FSDP] Remove device arg from _cast_buffers() #87933 [FSDP] Remove device arg from _cast_buffers()
[FSDP()][20/N][Easy] Move functions in file #87932 [FSDP()][20/N][Easy] Move functions in file
[FSDP()][18/N] Refactor pre_forward_unshard() #87931 [FSDP()][18/N] Refactor pre_forward_unshard()
[FSDP()][17/N] Refactor _fsdp_root_pre_forward() #87930 [FSDP()][17/N] Refactor _fsdp_root_pre_forward()
[FSDP()][16/N] Refactor post-forward/pre-backward #87929 [FSDP()][16/N] Refactor post-forward/pre-backward
[FSDP()][15/N] Refactor _init_streams() #87928 [FSDP()][15/N] Refactor _init_streams()
[FSDP()][14/N] Refactor pre-forward/post-backward #87927 [FSDP()][14/N] Refactor pre-forward/post-backward
[FSDP()][13/N] Refactor unshard/reshard/grads #87926 [FSDP()][13/N] Refactor unshard/reshard/grads
[FSDP()][12/N] Easy cleanup #87925 [FSDP()][12/N] Easy cleanup
[FSDP()][10/N][11/N] Introduce composable (ctor only) #87924 [FSDP()][10/N][11/N] Introduce composable (ctor only)
[FSDP()][9/N] Refactor ctor (continued) #87923 [FSDP()][9/N] Refactor ctor (continued)
[FSDP()][8/N] Refactor limiter's _FreeEventQueue #87922 [FSDP()][8/N] Refactor limiter's _FreeEventQueue
[FSDP()][7/N] Refactor most of ctor #87921 [FSDP()][7/N] Refactor most of ctor
[FSDP()][6/N] Refactor CPUOffload dataclass #87920 [FSDP()][6/N] Refactor CPUOffload dataclass
[FSDP()][5/N] Refactor MixedPrecision dataclass #87919 [FSDP()][5/N] Refactor MixedPrecision dataclass
[FSDP()][4/N] Refactor ShardingStrategy enum #87918 [FSDP()][4/N] Refactor ShardingStrategy enum
[FSDP()][3/N] Refactor public APIs #87917 [FSDP()][3/N] Refactor BackwardPrefetch enum
[FSDP()][2/N] Refactor training state #87916 [FSDP()][2/N] Refactor training state
[FSDP()][1/N] Start refactoring FSDP root pre-forward #87915 [FSDP()][1/N] Start refactoring FSDP root pre-forward
[FSDP] ufmt FSDP test #87812 [FSDP] ufmt FSDP test
[FSDP] ufmt /fsdp #87811 [FSDP] ufmt /fsdp

This PR actually has meaningful changes. We stratify TrainingState into two levels: one is per FSDP instance and one is per FlatParamHandle/FlatParameter.

At the FSDP instance level, we only care about IDLE, FSDP computation (i.e. FORWARD_BACKWARD), or SUMMON_FULL_PARAMS. These dynamically modify behavior (e.g. summon_full_params() forces full precision).
At the FlatParamHandle level, we care about the training state for invariants and debugging. Hence, we keep IDLE, FORWARD, BACKWARD_PRE, BACKWARD_POST, and SUMMON_FULL_PARAMS.

[ghstack-poisoned]

pytorch-bot · 2022-10-27T22:19:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87916

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7decb7b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2022-10-27T22:35:07Z

torch/distributed/fsdp/_common_utils.py

+from enum import auto, Enum
+
+
+class TrainingState(Enum):


Since this is in a private file, we do not need to make this private. I should do this more consistently (e.g. some later PRs in this stack will violate this), but I will leave that as BE follow-ups.

awgu · 2022-10-27T22:35:48Z

torch/distributed/fsdp/_state_dict_utils.py


 # Import the entire FSDP file to avoid circular imports
-import torch.distributed.fsdp.fully_sharded_data_parallel as FSDP
+import torch.distributed.fsdp.fully_sharded_data_parallel as fsdp_file


Same as the previous PR: Rename FSDP to fsdp_file to avoid confusion since we sometimes import FullyShardedDataParallel as FSDP.

This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`. - At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision). - At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`. [ghstack-poisoned]

mrshenli

naming-only changes. LGTM

mrshenli · 2022-10-28T20:23:33Z

@pytorchbot merge -g

pytorchmergebot · 2022-10-28T20:56:58Z

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: e01809e Pull Request resolved: pytorch#87916

awgu · 2022-10-29T00:31:52Z

torch/distributed/fsdp/_common_utils.py

@@ -0,0 +1,23 @@
+from enum import auto, Enum


I am maintaining this _common_utils.py as I refactor. Eventually, we will merge _utils.py into _common_utils.py or other files.

pytorchmergebot · 2022-10-29T02:55:44Z

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

awgu · 2022-10-29T03:04:05Z

@pytorchbot merge -g

pytorchmergebot · 2022-10-29T04:14:06Z

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-29T06:51:01Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

zhaojuanmao · 2022-11-01T08:37:34Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

-            self._assert_state(
-                [TrainingState_.BACKWARD_PRE, TrainingState_.BACKWARD_POST]
+            self._assert_state([TrainingState.FORWARD_BACKWARD])
+            self.training_state = TrainingState.FORWARD_BACKWARD


nit: why it is assigned the same state after it checked the state == TrainingState.FORWARD_BACKWARD

Good point. I refactored too fast and overlooked this redundancy :)

zhaojuanmao · 2022-11-01T08:39:43Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+            self._assert_state([TrainingState.FORWARD_BACKWARD])
+            self.training_state = TrainingState.FORWARD_BACKWARD
+            p_assert(
+                handle._training_state == HandleTrainingState.BACKWARD_PRE,


nice! since it is per handle state, no need to check BACKWARD_POST any more, which is much cleaner

This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`. - At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision). - At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`. Pull Request resolved: pytorch#87916 Approved by: https://github.com/mrshenli

[FSDP()][2/N] Refactor training state

510541c

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 27, 2022 22:19

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 27, 2022

awgu commented Oct 27, 2022

View reviewed changes

Andrew Gu added 2 commits October 28, 2022 04:17

mrshenli approved these changes Oct 28, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 28, 2022

awgu pushed a commit to awgu/pytorch that referenced this pull request Oct 28, 2022

[FSDP()][2/N] Refactor training state

c9dc7df

ghstack-source-id: e01809e Pull Request resolved: pytorch#87916

awgu commented Oct 29, 2022

View reviewed changes

pytorchmergebot added the Merged label Oct 29, 2022

pytorchmergebot closed this in e667c00 Oct 29, 2022

awgu mentioned this pull request Oct 29, 2022

[FSDP] Enable use_orig_params=True test #88034

Closed

zhaojuanmao reviewed Nov 1, 2022

View reviewed changes

facebook-github-bot deleted the gh/awgu/146/head branch June 8, 2023 15:23

[FSDP()][2/N] Refactor training state #87916

[FSDP()][2/N] Refactor training state #87916

Uh oh!

Conversation

awgu commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87916

✅ No Failures

Uh oh!

awgu Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Oct 28, 2022

Uh oh!

pytorchmergebot commented Oct 28, 2022

Merge started

Uh oh!

awgu Oct 29, 2022

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Oct 29, 2022

Uh oh!

awgu commented Oct 29, 2022

Uh oh!

pytorchmergebot commented Oct 29, 2022

Merge started

Uh oh!

github-actions bot commented Oct 29, 2022

Uh oh!

zhaojuanmao Nov 1, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Nov 1, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Nov 1, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awgu commented Oct 27, 2022 •

edited

Loading

pytorch-bot bot commented Oct 27, 2022 •

edited

Loading