Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Jan 11, 2023

Stack from ghstack:

IIUC, FSDP assumes that rank r uses cuda:r. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only.

The SPSD assumption can help us simplify the compute_device and device_id code since then specifying device_id is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, cuda:r). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD.

Feel free to let me know your guys' thoughts.

If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 11, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92035

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 3e82ae9:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only.

The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD.

Feel free to let me know your guys' thoughts.

If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661.

[ghstack-poisoned]
@zhaojuanmao
Copy link
Contributor

I think the SPSD and CUDA device only assumption is current FSDP state, " we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device." sounds great to me, thank you!!

IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only.

The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD.

Feel free to let me know your guys' thoughts.

If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661.

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jan 12, 2023
ghstack-source-id: fb32c26
Pull Request resolved: #92035
IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only.

The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD.

Feel free to let me know your guys' thoughts.

If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661.

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jan 12, 2023
ghstack-source-id: ac5de49
Pull Request resolved: #92035
IIUC, FSDP assumes that rank `r` uses `cuda:r`. In other words, FSDP assumes single-process single-device (SPSD) and CUDA devices only.

The SPSD assumption can help us simplify the `compute_device` and `device_id` code since then specifying `device_id` is simply a signal to FSDP whether the user wants FSDP to move parameters on behalf of the user or not. The actual target device is fixed (namely, `cuda:r`). Since the SPSD assumption is deeply embedded into FSDP, I think this is reasonable. We can generalize over non-CUDA devices in the future while still enforcing SPSD.

Feel free to let me know your guys' thoughts.

If we go with this, we can provide earlier and cleaner error handling in the case the user forgets to set the current CUDA device. See #91661.

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jan 14, 2023
ghstack-source-id: df22a11
Pull Request resolved: #92035
awgu pushed a commit to awgu/pytorch that referenced this pull request Jan 19, 2023
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Mar 15, 2023
@github-actions github-actions bot closed this Apr 14, 2023
@facebook-github-bot facebook-github-bot deleted the gh/awgu/294/head branch June 8, 2023 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants