Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Oct 27, 2022

Stack from ghstack:

This PR actually has meaningful changes. We stratify TrainingState into two levels: one is per FSDP instance and one is per FlatParamHandle/FlatParameter.

  • At the FSDP instance level, we only care about IDLE, FSDP computation (i.e. FORWARD_BACKWARD), or SUMMON_FULL_PARAMS. These dynamically modify behavior (e.g. summon_full_params() forces full precision).
  • At the FlatParamHandle level, we care about the training state for invariants and debugging. Hence, we keep IDLE, FORWARD, BACKWARD_PRE, BACKWARD_POST, and SUMMON_FULL_PARAMS.

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 27, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87916

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7decb7b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

from enum import auto, Enum


class TrainingState(Enum):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in a private file, we do not need to make this private. I should do this more consistently (e.g. some later PRs in this stack will violate this), but I will leave that as BE follow-ups.


# Import the entire FSDP file to avoid circular imports
import torch.distributed.fsdp.fully_sharded_data_parallel as FSDP
import torch.distributed.fsdp.fully_sharded_data_parallel as fsdp_file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous PR: Rename FSDP to fsdp_file to avoid confusion since we sometimes import FullyShardedDataParallel as FSDP.

Andrew Gu added 2 commits October 28, 2022 04:17
This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`.
- At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision).
- At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`.

[ghstack-poisoned]
This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`.
- At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision).
- At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`.

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming-only changes. LGTM

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 28, 2022
@mrshenli
Copy link
Contributor

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

awgu pushed a commit to awgu/pytorch that referenced this pull request Oct 28, 2022
ghstack-source-id: e01809e
Pull Request resolved: pytorch#87916
@@ -0,0 +1,23 @@
from enum import auto, Enum
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am maintaining this _common_utils.py as I refactor. Eventually, we will merge _utils.py into _common_utils.py or other files.

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@awgu
Copy link
Collaborator Author

awgu commented Oct 29, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link
Contributor

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

self._assert_state(
[TrainingState_.BACKWARD_PRE, TrainingState_.BACKWARD_POST]
self._assert_state([TrainingState.FORWARD_BACKWARD])
self.training_state = TrainingState.FORWARD_BACKWARD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why it is assigned the same state after it checked the state == TrainingState.FORWARD_BACKWARD

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I refactored too fast and overlooked this redundancy :)

self._assert_state([TrainingState.FORWARD_BACKWARD])
self.training_state = TrainingState.FORWARD_BACKWARD
p_assert(
handle._training_state == HandleTrainingState.BACKWARD_PRE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! since it is per handle state, no need to check BACKWARD_POST any more, which is much cleaner

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`.
- At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision).
- At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`.
Pull Request resolved: pytorch#87916
Approved by: https://github.com/mrshenli
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`.
- At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision).
- At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`.
Pull Request resolved: pytorch#87916
Approved by: https://github.com/mrshenli
@facebook-github-bot facebook-github-bot deleted the gh/awgu/146/head branch June 8, 2023 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants