-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (resubmission of #43386) #43970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rs for allreduce for Gloo, NCCL backends" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit faac5ee (more details on the Dr. CI page):
3 jobs timed out:
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 85 times. |
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…rs for allreduce for Gloo, NCCL backends" Pull Request resolved: #43970 Original commit changeset: 27fbeb161706 ghstack-source-id: 111190323 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
|
We can close out this PR since the original diff has been reverted? |
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…rs for allreduce for Gloo, NCCL backends" Pull Request resolved: #43970 Original commit changeset: 27fbeb161706 ghstack-source-id: 111397109 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a resubmission of #43386, can we update the PR title and summary to reflect this appropriately?
test/distributed/test_distributed.py
Outdated
|
|
||
| result = work.result() | ||
| expected_value = 2 + (10 * (len(group) - 1)) | ||
| self.assertEqual(result, [_build_tensor(src + 1, expected_value)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since allreduce is in place we can just check if result is equal to tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since allreduce is in place we can just check if result is equal to tensor?
Being equal to tensor does not mean result is ready, it could be both of them have non-finalized values. I see that's being tested in other unitttests, but still, better to not depend on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are calling work.wait() before the check though right? That should synchronize the appropriate streams to ensure this is safe to do.
| std::vector<at::Tensor> outputs_; | ||
| }; | ||
|
|
||
| class AsyncAllreduceCoalescedWork : public AsyncAllreduceWork { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from allreduce, there are other ops like barrier, scatter, gather, reduce etc. can we add this functionality to those as well? Can be in a separate PR on top of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in separate PR
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
| bool result = exception() || finishedGPUExecutionInternal(); | ||
| return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this in a single line as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a temporary change to add some log lines. Reverted.
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
| } | ||
|
|
||
| std::vector<at::Tensor> ProcessGroupNCCL::WorkNCCL::result() { | ||
| TORCH_CHECK(isCompleted()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily true for NCCL. isCompleted() returns true only when the work has completed successfully on the GPU. But we can still retrieve the result for the Work and perform async processing. I think we shouldn't have this check for NCCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed.
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…rs for allreduce for Gloo, NCCL backends" Pull Request resolved: #43970 Original commit changeset: 27fbeb161706 ghstack-source-id: 111453596 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
|
Verified that failing test also passed: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683 |
…esult tensors for allreduce for Gloo, NCCL backends"" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…CCL backends Pull Request resolved: #43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111494058 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Can we also update the PR summary appropriate? It currently still mentions "#43970 Back out "Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends"". Also, please check all the builds are clean before landing.
…for Gloo, NCCL backends (resubmission of #43386)" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…CCL backends Pull Request resolved: #43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111541355 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
…for Gloo, NCCL backends (resubmission of #43386)" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…for Gloo, NCCL backends (resubmission of #43386)" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…CCL backends Pull Request resolved: #43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111742817 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
…for Gloo, NCCL backends (resubmission of #43386)" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…for Gloo, NCCL backends (resubmission of #43386)" Original commit changeset: 27fbeb161706 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/) [ghstack-poisoned]
…CCL backends Pull Request resolved: #43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111775070 Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
I just rebased with latest changes in trunk. I am going to submit this as soon as testing finishes. |
|
This pull request has been merged in 2e744b1. |
…CCL backends (#43970) Summary: Pull Request resolved: #43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111775070 Test Plan: Added checks to existing unit test and ran it on gpu devserver. Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683 Reviewed By: pritamdamania87 Differential Revision: D23455047 fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
…CCL backends (#43970) Summary: Pull Request resolved: #43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111775070 Test Plan: Added checks to existing unit test and ran it on gpu devserver. Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683 Reviewed By: pritamdamania87 Differential Revision: D23455047 fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
Stack from ghstack:
Original commit changeset: 27fbeb161706
Differential Revision: D23455047