Skip to content

Conversation

@vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Aug 30, 2020

Stack from ghstack:

Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan:

run DDP + SyncBN on two process groups:
https://gist.github.com/vkuzo/8501121d0d25261b70110aae4465ae0a
the script fails before this PR and works after this PR

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D23418816

Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@vkuzo vkuzo requested a review from apaszke as a code owner August 30, 2020 16:25
vkuzo added a commit that referenced this pull request Aug 30, 2020
Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fb76ed7
Pull Request resolved: #43861
@vkuzo vkuzo changed the title Fix SyncBatchNorm forward pass for non-default process group [redo] Fix SyncBatchNorm forward pass for non-default process group Aug 30, 2020
@vkuzo vkuzo requested a review from ngimel August 30, 2020 16:30
@codecov
Copy link

codecov bot commented Aug 30, 2020

Codecov Report

Merging #43861 into gh/vkuzo/134/base will increase coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@                Coverage Diff                 @@
##           gh/vkuzo/134/base   #43861   +/-   ##
==================================================
  Coverage              69.31%   69.32%           
==================================================
  Files                    378      378           
  Lines                  46745    46745           
==================================================
+ Hits                   32403    32404    +1     
+ Misses                 14342    14341    -1     
Impacted Files Coverage Δ
torch/nn/modules/_functions.py 63.30% <0.00%> (ø)
torch/testing/_internal/expecttest.py 78.57% <0.00%> (+1.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a41fa4...bc60649. Read the comment docs.

@dr-ci
Copy link

dr-ci bot commented Aug 30, 2020

💊 CI failures summary and remediations

As of commit bc60649 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


codecov.io: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add test?

@vkuzo
Copy link
Contributor Author

vkuzo commented Sep 2, 2020

would be good to add test?

@ngimel would you be ok with a manual test for this instance (100% ok to say yes)? There are some issues with my devgpu which are preventing me from running the OSS distributed test suite properly, so just hoping to plan whether I need to block a few hours to fix my env for this PR.

@ngimel
Copy link
Collaborator

ngimel commented Sep 2, 2020

Yeah, manual test is ok, our OSS CI probably just has 2 gpus and won't allow you to create non-default process group anyway.

@vkuzo
Copy link
Contributor Author

vkuzo commented Sep 2, 2020

Yeah, manual test is ok, our OSS CI probably just has 2 gpus and won't allow you to create non-default process group anyway.

great, thanks, added manual test to test plan

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b167402.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants