-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add a warning log when there is high skew of uneven inputs in DDP training #45238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ining Adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]
… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]
… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]
… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks in #42577, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit fb43098 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 6 times. |
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
torch/nn/parallel/distributed.py
Outdated
| "other currently active ranks. This level of skew could " | ||
| "lead to performance degradation during training." | ||
| ) | ||
| warned = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please feel free to ignore. This can also be done using warnings.simplefilter("once") IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is better than toggling the boolean.
… in DDP training" This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks in #42577, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/) [ghstack-poisoned]
…ining Pull Request resolved: #45238 Adds a warning when there is much higher than expected amount of discrepancy of inputs across different processes when running with uneven inputs. This is because a skew in the thousands can reduce performance a nontrivial amount as shown in benchmarks, and it was proposed to add this warning as a result. Tested by running the tests so the threshold is hit and observing the output. ghstack-source-id: 112773552 Differential Revision: [D23719270](https://our.internmc.facebook.com/intern/diff/D23719270/)
Codecov Report
@@ Coverage Diff @@
## gh/rohan-varma/179/base #45238 +/- ##
===========================================================
- Coverage 68.01% 68.00% -0.01%
===========================================================
Files 393 393
Lines 50847 50854 +7
===========================================================
+ Hits 34583 34584 +1
- Misses 16264 16270 +6
Continue to review full report at Codecov.
|
|
This pull request has been merged in e57a081. |
Stack from ghstack:
This request came up in feature review for DDP uneven inputs, so this PR adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks in #42577, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
Differential Revision: D23719270