-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Support sparse gradients in DistributedDataParallel #19443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Differential Revision: D15007365 Differential Version: 80026012
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add some tests?
Differential Revision: D15007365 Differential Version: 85199090
Differential Revision: D15007365 Differential Version: 85203697
Differential Revision: D15007365 Differential Version: 85204462
|
@mrshenli Added a test case that confirms numerical equivalence between unwrapped and wrapped module. |
|
@pytorchbot retest this please |
|
This pull request has been merged in 365de7b. |
|
This diff stack broke ~all tests on master. Sample failure: https://circleci.com/gh/pytorch/pytorch/2038576?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link/console |
|
hello, I wonder if the modification is effective in these .whl files ? torch_download |
|
I got torch by |
|
I cannot set sparse=True in nn.Embedding under ddp on 1.9.1+cu102, too. |
Stack:
:black_circle: #19443 Support sparse gradients in DistributedDataParallel 💛
:white_circle: #19146 Add sparse tensor allreduce 💛
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Differential Revision: D15007365