Skip to content

Conversation

@wanchaol
Copy link
Collaborator

@wanchaol wanchaol commented May 3, 2024

looks like we can make it work :)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125475

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ed26b4f with merge base 0199ce8 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ci-td-distributed ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 3, 2024
Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!

looks like we can make it work :)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
@wanchaol wanchaol added the release notes: distributed (dtensor) release notes category label May 3, 2024
looks like we can make it work :)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
wanchaol added 2 commits May 3, 2024 12:56
looks like we can make it work :)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
looks like we can make it work :)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
@wanchaol
Copy link
Collaborator Author

wanchaol commented May 3, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

looks like we can make it work :)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
self.assertEqual(comm_counts[funcol.all_reduce], 1)
# FSDP comms
self.assertEqual(comm_counts[c10d_ops._allgather_base_], 1)
self.assertEqual(comm_counts[c10d_ops._reduce_scatter_base_], 1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can capture FSDP comms (although it triggered some failures in CI but this is nice to see!)

@wanchaol
Copy link
Collaborator Author

wanchaol commented May 3, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request May 4, 2024
pytorchmergebot pushed a commit that referenced this pull request May 6, 2024
@XilunWu
Copy link
Contributor

XilunWu commented May 13, 2024

Hi Wanchao. I'm planning to add all c10d collective ops to CommDebugMode. WDYT (worth to do? Difficulty?). cc @awgu

@wanchaol
Copy link
Collaborator Author

Hi Wanchao. I'm planning to add all c10d collective ops to CommDebugMode. WDYT (worth to do? Difficulty?). cc @awgu

@XilunWu yeah feel free to do it :) I think it's not that difficult, one just need to add the ops, and add tests.

@github-actions github-actions bot deleted the gh/wanchaol/460/head branch June 14, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-td-distributed ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants