Add a warning log when there is high skew of uneven inputs in DDP training #45238

rohan-varma · 2020-09-23T22:29:36Z

Stack from ghstack:

Add a warning log when there is high skew of uneven inputs in DDP training #45238 Add a warning log when there is high skew of uneven inputs in DDP training

This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks in #42577, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.

Differential Revision: D23719270

…ining Adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]

… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]

… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks in #42577, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]

dr-ci · 2020-09-23T23:43:01Z

💊 CI failures summary and remediations

As of commit fb43098 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: Codecov - codecov/patch

codecov.io: 1 failed

Failed: codecov/patch

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 6 times.

mrshenli

LGTM!

mrshenli · 2020-09-24T01:33:35Z

torch/nn/parallel/distributed.py

+                            "other currently active ranks. This level of skew could "
+                            "lead to performance degradation during training."
+                        )
+                        warned = True


Please feel free to ignore. This can also be done using warnings.simplefilter("once") IIUC.

Thanks! This is better than toggling the boolean.

… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks in #42577, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]

…ining Pull Request resolved: #45238 Adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. ghstack-source-id: 112773552 Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/)

codecov · 2020-09-24T07:43:42Z

Codecov Report

Merging #45238 into gh/rohan-varma/179/base will decrease coverage by 0.00%.
The diff coverage is 12.50%.

@@                     Coverage Diff                     @@
##           gh/rohan-varma/179/base   #45238      +/-   ##
===========================================================
- Coverage                    68.01%   68.00%   -0.01%     
===========================================================
  Files                          393      393              
  Lines                        50847    50854       +7     
===========================================================
+ Hits                         34583    34584       +1     
- Misses                       16264    16270       +6

Impacted Files	Coverage Δ
torch/nn/parallel/distributed.py	`41.37% <12.50%> (-0.68%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99242ec...fb43098. Read the comment docs.

facebook-github-bot · 2020-09-24T18:12:21Z

This pull request has been merged in e57a081.

rohan-varma requested a review from apaszke as a code owner September 23, 2020 22:29

rohan-varma requested review from mrshenli and pritamdamania87 September 23, 2020 22:39

mrshenli approved these changes Sep 24, 2020

View reviewed changes

facebook-github-bot closed this in e57a081 Sep 24, 2020

facebook-github-bot added the merged label Sep 24, 2020

facebook-github-bot deleted the gh/rohan-varma/179/head branch September 28, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a warning log when there is high skew of uneven inputs in DDP training #45238

Add a warning log when there is high skew of uneven inputs in DDP training #45238

Uh oh!

rohan-varma commented Sep 23, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Sep 23, 2020 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli Sep 24, 2020

Uh oh!

rohan-varma Sep 24, 2020

Uh oh!

codecov bot commented Sep 24, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add a warning log when there is high skew of uneven inputs in DDP training #45238

Add a warning log when there is high skew of uneven inputs in DDP training #45238

Uh oh!

Conversation

rohan-varma commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

codecov.io: 1 failed

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot commented Sep 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohan-varma commented Sep 23, 2020 •

edited

Loading

dr-ci bot commented Sep 23, 2020 •

edited

Loading

codecov bot commented Sep 24, 2020 •

edited

Loading