Skip to content

Conversation

@mrzzd
Copy link
Contributor

@mrzzd mrzzd commented Sep 1, 2020

Stack from ghstack:

Original commit changeset: 27fbeb161706

Differential Revision: D23455047

…rs for allreduce for Gloo, NCCL backends"

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Sep 1, 2020

💊 CI failures summary and remediations

As of commit faac5ee (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-CircleCI failure(s)

3 jobs timed out:

  • binary_linux_libtorch_3_7m_cu101_gcc5_4_cxx11-abi_nightly_shared-with-deps_build
  • binary_linux_libtorch_3_7m_cu101_gcc5_4_cxx11-abi_nightly_static-with-deps_build
  • binary_linux_libtorch_3_7m_cu101_gcc5_4_cxx11-abi_nightly_shared-without-deps_build

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 85 times.

…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 1, 2020
…rs for allreduce for Gloo, NCCL backends"

Pull Request resolved: #43970

Original commit changeset: 27fbeb161706
ghstack-source-id: 111190323

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
@pritamdamania87
Copy link
Contributor

We can close out this PR since the original diff has been reverted?

@mrzzd mrzzd added the ci-all/ label Sep 3, 2020
…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 3, 2020
…rs for allreduce for Gloo, NCCL backends"

Pull Request resolved: #43970

Original commit changeset: 27fbeb161706
ghstack-source-id: 111397109

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a resubmission of #43386, can we update the PR title and summary to reflect this appropriately?


result = work.result()
expected_value = 2 + (10 * (len(group) - 1))
self.assertEqual(result, [_build_tensor(src + 1, expected_value)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since allreduce is in place we can just check if result is equal to tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since allreduce is in place we can just check if result is equal to tensor?

Being equal to tensor does not mean result is ready, it could be both of them have non-finalized values. I see that's being tested in other unitttests, but still, better to not depend on them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling work.wait() before the check though right? That should synchronize the appropriate streams to ensure this is safe to do.

std::vector<at::Tensor> outputs_;
};

class AsyncAllreduceCoalescedWork : public AsyncAllreduceWork {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from allreduce, there are other ops like barrier, scatter, gather, reduce etc. can we add this functionality to those as well? Can be in a separate PR on top of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in separate PR

Comment on lines 249 to 250
bool result = exception() || finishedGPUExecutionInternal();
return result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have this in a single line as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a temporary change to add some log lines. Reverted.

}

std::vector<at::Tensor> ProcessGroupNCCL::WorkNCCL::result() {
TORCH_CHECK(isCompleted());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily true for NCCL. isCompleted() returns true only when the work has completed successfully on the GPU. But we can still retrieve the result for the Work and perform async processing. I think we shouldn't have this check for NCCL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed.

…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 4, 2020
…rs for allreduce for Gloo, NCCL backends"

Pull Request resolved: #43970

Original commit changeset: 27fbeb161706
ghstack-source-id: 111453596

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
@mrzzd
Copy link
Contributor Author

mrzzd commented Sep 4, 2020

…esult tensors for allreduce for Gloo, NCCL backends""

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 4, 2020
…CCL backends

Pull Request resolved: #43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111494058

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
@mrzzd mrzzd changed the title Back out "Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends" Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (resubmission of #43386) Sep 4, 2020
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Can we also update the PR summary appropriate? It currently still mentions "#43970 Back out "Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends"". Also, please check all the builds are clean before landing.

…for Gloo, NCCL backends (resubmission of #43386)"

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 7, 2020
…CCL backends

Pull Request resolved: #43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111541355

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
…for Gloo, NCCL backends (resubmission of #43386)"

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
…for Gloo, NCCL backends (resubmission of #43386)"

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 10, 2020
…CCL backends

Pull Request resolved: #43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111742817

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
…for Gloo, NCCL backends (resubmission of #43386)"

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
…for Gloo, NCCL backends (resubmission of #43386)"

Original commit changeset: 27fbeb161706

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)

[ghstack-poisoned]
mrzzd pushed a commit that referenced this pull request Sep 10, 2020
…CCL backends

Pull Request resolved: #43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111775070

Differential Revision: [D23455047](https://our.internmc.facebook.com/intern/diff/D23455047/)
@mrzzd
Copy link
Contributor Author

mrzzd commented Sep 10, 2020

We can close out this PR since the original diff has been reverted?

I just rebased with latest changes in trunk. I am going to submit this as soon as testing finishes.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2e744b1.

bitfort pushed a commit that referenced this pull request Sep 11, 2020
…CCL backends (#43970)

Summary:
Pull Request resolved: #43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111775070

Test Plan:
Added checks to existing unit test and ran it on gpu devserver.
Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683

Reviewed By: pritamdamania87

Differential Revision: D23455047

fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
@facebook-github-bot facebook-github-bot deleted the gh/mrzzd/2/head branch September 14, 2020 14:17
xuzhao9 pushed a commit that referenced this pull request Sep 18, 2020
…CCL backends (#43970)

Summary:
Pull Request resolved: #43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111775070

Test Plan:
Added checks to existing unit test and ran it on gpu devserver.
Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683

Reviewed By: pritamdamania87

Differential Revision: D23455047

fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants