Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Jan 30, 2023

Stack from ghstack (oldest at bottom):

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: D42865237

… own the corresponding parameter

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93318

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1c9da19:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 30, 2023
fegin added a commit that referenced this pull request Jan 30, 2023
… own the corresponding parameter

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)

ghstack-source-id: 178833186
Pull Request resolved: #93318
…at does not own the corresponding parameter"

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 31, 2023
… own the corresponding parameter

Pull Request resolved: #93318

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.
ghstack-source-id: 178911152

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but could we add an example unittest that would've raised an error / correctness issue that this change would fix?

local_state = (
optim_state[name]
if name in optim_state
else torch.empty(max_numel, dtype=dtype, device=fsdp_state.compute_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid lint?

…at does not own the corresponding parameter"

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)

[ghstack-poisoned]
…at does not own the corresponding parameter"

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Feb 1, 2023
… own the corresponding parameter

Pull Request resolved: #93318

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.
ghstack-source-id: 179074159

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)
…at does not own the corresponding parameter"

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Feb 2, 2023
… own the corresponding parameter

Pull Request resolved: #93318

When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.
ghstack-source-id: 179122260

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)
@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 2, 2023
@fegin
Copy link
Contributor Author

fegin commented Feb 2, 2023

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 2, 2023

This PR needs to be approved by an authorized maintainer before merge.

@rohan-varma rohan-varma self-requested a review February 3, 2023 00:17
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp to unblock but would be great if we can add the unittest or file an issue to do so.

@fegin
Copy link
Contributor Author

fegin commented Feb 3, 2023

stamp to unblock but would be great if we can add the unittest or file an issue to do so.

Oh, this PR already has a UT for that.

@fegin
Copy link
Contributor Author

fegin commented Feb 3, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/fegin/67/head branch June 8, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants