-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[redo] Fix SyncBatchNorm forward pass for non-default process group #43861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## gh/vkuzo/134/base #43861 +/- ##
==================================================
Coverage 69.31% 69.32%
==================================================
Files 378 378
Lines 46745 46745
==================================================
+ Hits 32403 32404 +1
+ Misses 14342 14341 -1
Continue to review full report at Codecov.
|
💊 CI failures summary and remediationsAs of commit bc60649 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 1 time. |
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add test?
@ngimel would you be ok with a manual test for this instance (100% ok to say yes)? There are some issues with my devgpu which are preventing me from running the OSS distributed test suite properly, so just hoping to plan whether I need to block a few hours to fix my env for this PR. |
|
Yeah, manual test is ok, our OSS CI probably just has 2 gpus and won't allow you to create non-default process group anyway. |
great, thanks, added manual test to test plan |
|
This pull request has been merged in b167402. |
Stack from ghstack:
Summary:
This is a redo of #38874, and
fixing my original bug from
#38246.
Test Plan:
run DDP + SyncBN on two process groups:
https://gist.github.com/vkuzo/8501121d0d25261b70110aae4465ae0a
the script fails before this PR and works after this PR
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D23418816