Skip to content

Support XPU in --nproc-per-node option to torchrun#159474

Closed
moksiuc wants to merge 22 commits intopytorch:mainfrom
moksiuc:moksiucik_torchrun_xpu
Closed

Support XPU in --nproc-per-node option to torchrun#159474
moksiuc wants to merge 22 commits intopytorch:mainfrom
moksiuc:moksiucik_torchrun_xpu

Conversation

@moksiuc
Copy link
Contributor

@moksiuc moksiuc commented Jul 30, 2025

Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

Support both --nproc-per-node=xpu and autodetection of XPU
device in case of --nproc-per-node=auto
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159474

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit eef7e98 with merge base be8095b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 30, 2025
@moksiuc
Copy link
Contributor Author

moksiuc commented Jul 30, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 30, 2025
@albanD albanD requested a review from d4l3k July 30, 2025 13:33
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 30, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 31, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 31, 2025
@guangyey guangyey added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Jul 31, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 31, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2025
@guangyey guangyey added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 31, 2025
@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Jul 31, 2025
@moksiuc
Copy link
Contributor Author

moksiuc commented Jul 31, 2025

I had to make additional modification due to lint error:

Lint for torch/distributed/run.py:

Error (MYPY) [union-attr]
Item "None" of "device | None" has no attribute "type"

    709  |        elif nproc_per_node == "auto":
    710  |            if torch.accelerator.is_available():
    711  |                num_proc = torch.accelerator.device_count()
>>> 712  |                device_type = torch.accelerator.current_accelerator().type
    713  |            else:
    714  |                num_proc = os.cpu_count()
    715  |                device_type = "cpu"

guangyey
guangyey previously approved these changes Aug 1, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 1, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 11, 2025
@moksiuc moksiuc requested a review from d4l3k August 12, 2025 08:47
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 13, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Sep 1, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Sep 2, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Sep 2, 2025
@moksiuc moksiuc requested a review from EikanWang September 10, 2025 08:08
@moksiuc
Copy link
Contributor Author

moksiuc commented Sep 10, 2025

@gujinghui

@moksiuc
Copy link
Contributor Author

moksiuc commented Sep 11, 2025

@guangyey , what should be done further for this PR to be merged ?

@guangyey
Copy link
Collaborator

Hey @moksiuc, since @d4l3k is quite busy, I’ll remind him to take another look at this soon.

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guangyey
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 12, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

Pull Request resolved: pytorch#159474
Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

Pull Request resolved: pytorch#159474
Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

Pull Request resolved: pytorch#159474
Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

Pull Request resolved: pytorch#159474
Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
@moksiuc moksiuc deleted the moksiucik_torchrun_xpu branch December 10, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants