[FSDP][RFC] Enforce rank `r`'s current device is `cuda:r` #92035

awgu · 2023-01-11T21:44:34Z

Stack from ghstack:

[FSDP][RFC] Enforce rank r's current device is cuda:r #92035 [FSDP][RFC] Enforce rank r's current device is cuda:r
[FSDP] Do not clean FQNs even for use_orig_params=True #91767 [FSDP] Do not clean FQNs even for use_orig_params=True
[FSDP][BE] Improve device_id + CPU offload test #92031 [FSDP][BE] Improve device_id + CPU offload test
[FSDP][BE] Rename prefixed_param_names -> fqns for consolidation #92028 [FSDP][BE] Rename prefixed_param_names -> fqns for consolidation
[FSDP][BE] Better error msg for incorrect device for training #92027 [FSDP][BE] Better error msg for incorrect device for training

IIUC, FSDP assumes that rank r uses cuda:r. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only.

The SPSD assumption can help us simplify the compute_device and device_id code since then specifying device_id is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, cuda:r). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD.

Feel free to let me know your guys' thoughts.

If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661.

[ghstack-poisoned]

pytorch-bot · 2023-01-11T21:44:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92035

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 3e82ae9:

NEW FAILURES - The following jobs have failed:

linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only. The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD. Feel free to let me know your guys' thoughts. If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661. [ghstack-poisoned]

zhaojuanmao · 2023-01-12T01:27:16Z

I think the SPSD and CUDA device only assumption is current FSDP state, " we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device." sounds great to me, thank you!!

IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only. The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD. Feel free to let me know your guys' thoughts. If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661. [ghstack-poisoned]

ghstack-source-id: fb32c26 Pull Request resolved: #92035

IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only. The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD. Feel free to let me know your guys' thoughts. If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661. [ghstack-poisoned]

ghstack-source-id: ac5de49 Pull Request resolved: #92035

IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only. The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD. Feel free to let me know your guys' thoughts. If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661. [ghstack-poisoned]

ghstack-source-id: df22a11 Pull Request resolved: #92035

ghstack-source-id: df22a11 Pull Request resolved: pytorch#92035

github-actions · 2023-03-15T16:39:15Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

[FSDP][RFC] Enforce rank r's current device is cuda:r

4fb6545

[ghstack-poisoned]

awgu mentioned this pull request Jan 11, 2023

[FSDP] Clarify MixedPrecision docs #91974

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 11, 2023

awgu mentioned this pull request Jan 11, 2023

[FSDP] Re-support model dtype change after FSDP init #91192

Closed

awgu pushed a commit that referenced this pull request Jan 12, 2023

[FSDP][RFC] Enforce rank r's current device is cuda:r

4f155c2

ghstack-source-id: fb32c26 Pull Request resolved: #92035

awgu pushed a commit that referenced this pull request Jan 12, 2023

[FSDP][RFC] Enforce rank r's current device is cuda:r

a1cfb99

ghstack-source-id: ac5de49 Pull Request resolved: #92035

awgu pushed a commit that referenced this pull request Jan 14, 2023

[FSDP][RFC] Enforce rank r's current device is cuda:r

26f9d44

ghstack-source-id: df22a11 Pull Request resolved: #92035

awgu pushed a commit to awgu/pytorch that referenced this pull request Jan 19, 2023

[FSDP][RFC] Enforce rank r's current device is cuda:r

044f3a0

ghstack-source-id: df22a11 Pull Request resolved: pytorch#92035

github-actions bot added the Stale label Mar 15, 2023

github-actions bot closed this Apr 14, 2023

facebook-github-bot deleted the gh/awgu/294/head branch June 8, 2023 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP][RFC] Enforce rank `r`'s current device is `cuda:r` #92035

[FSDP][RFC] Enforce rank `r`'s current device is `cuda:r` #92035

Uh oh!

awgu commented Jan 11, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 11, 2023 •

edited

Loading

Uh oh!

zhaojuanmao commented Jan 12, 2023

Uh oh!

github-actions bot commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP][RFC] Enforce rank r's current device is cuda:r #92035

[FSDP][RFC] Enforce rank r's current device is cuda:r #92035

Uh oh!

Conversation

awgu commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92035

❌ 1 Failures

Uh oh!

zhaojuanmao commented Jan 12, 2023

Uh oh!

github-actions bot commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP][RFC] Enforce rank `r`'s current device is `cuda:r` #92035

[FSDP][RFC] Enforce rank `r`'s current device is `cuda:r` #92035

awgu commented Jan 11, 2023 •

edited

Loading

pytorch-bot bot commented Jan 11, 2023 •

edited

Loading