[optim] _actually_ default to foreach #95862

janeyx99 · 2023-03-02T05:12:34Z

Big OOP correction. Also added a test this time to verify the defaulting was as expected.

We've claimed + intended to default to foreach as much as we can, but today we realize we missed two critical (the biggest, essentially all real use cases) groups:

models. we forgot to check for nn.Parameters as native tensors, so foreach was only defaulting on test cases/people who didn't use models.
we were previously checking that ALL relevant tensors were on CUDA to flip the switch. almost all of the state_steps, however, are on CPU the whole time, so a great majority of these did not flip correctly.

Stack from ghstack (oldest at bottom):

This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS. Pull Request resolved: pytorch#95811 Approved by: https://github.com/albanD

Big OOP correction continued. Also added a test this time to verify the defaulting was as expected. The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage. Pull Request resolved: pytorch#95820 Approved by: https://github.com/albanD

pytorch-bot · 2023-03-02T05:12:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95862

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 1bbb43e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

atalman

LGTM

* [optim] include nn.Parameter as foreach supported (pytorch#95811) This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS. Pull Request resolved: pytorch#95811 Approved by: https://github.com/albanD * [optim] Widen the cases for defaulting to foreach (pytorch#95820) Big OOP correction continued. Also added a test this time to verify the defaulting was as expected. The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage. Pull Request resolved: pytorch#95820 Approved by: https://github.com/albanD

janeyx99 added 2 commits March 2, 2023 00:05

janeyx99 requested review from H-Huang, albanD, awgu, fegin, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners March 2, 2023 05:12

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Mar 2, 2023

janeyx99 mentioned this pull request Mar 2, 2023

[v.2.0.0] Release Tracker #94937

Closed

atalman approved these changes Mar 2, 2023

View reviewed changes

atalman merged commit 0865964 into pytorch:release/2.0 Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[optim] _actually_ default to foreach #95862

[optim] _actually_ default to foreach #95862

Uh oh!

janeyx99 commented Mar 2, 2023

Uh oh!

pytorch-bot bot commented Mar 2, 2023 •

edited

Loading

Uh oh!

atalman left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[optim] _actually_ default to foreach #95862

[optim] _actually_ default to foreach #95862

Uh oh!

Conversation

janeyx99 commented Mar 2, 2023

Uh oh!

pytorch-bot bot commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95862

⏳ No Failures, 3 Pending

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 2, 2023 •

edited

Loading