Skip to content

Conversation

@janeyx99
Copy link
Contributor

@janeyx99 janeyx99 commented Mar 2, 2023

Big OOP correction. Also added a test this time to verify the defaulting was as expected.

We've claimed + intended to default to foreach as much as we can, but today we realize we missed two critical (the biggest, essentially all real use cases) groups:

  1. models. we forgot to check for nn.Parameters as native tensors, so foreach was only defaulting on test cases/people who didn't use models.
  2. we were previously checking that ALL relevant tensors were on CUDA to flip the switch. almost all of the state_steps, however, are on CPU the whole time, so a great majority of these did not flip correctly.

Stack from ghstack (oldest at bottom):

janeyx99 added 2 commits March 2, 2023 00:05
This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS.

Pull Request resolved: pytorch#95811
Approved by: https://github.com/albanD
Big OOP correction continued. Also added a test this time to verify the defaulting was as expected.

The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage.

Pull Request resolved: pytorch#95820
Approved by: https://github.com/albanD
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 2, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95862

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 1bbb43e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Mar 2, 2023
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@atalman atalman merged commit 0865964 into pytorch:release/2.0 Mar 2, 2023
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request May 3, 2023
* [optim] include nn.Parameter as foreach supported (pytorch#95811)

This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS.

Pull Request resolved: pytorch#95811
Approved by: https://github.com/albanD

* [optim] Widen the cases for defaulting to foreach (pytorch#95820)

Big OOP correction continued. Also added a test this time to verify the defaulting was as expected.

The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage.

Pull Request resolved: pytorch#95820
Approved by: https://github.com/albanD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants