Skip to content

Conversation

@sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Aug 11, 2020

Stack from ghstack:

We realized that when we invoke a simple callback that divides the tensors by world_size after allreduce, the performance was almost 50% lower in terms of QPS compared to the case where a simple allreduce hook is used with no then callback.

The main problem was as we call work.wait() before invoking then callback, we were synchronizing work's stream with the default PyTorch stream inside runHook and stalling the backward computation.

In that PR, we ensure that FutureNCCL's then callback is not stalling the backward computation. Assuming single-process single-device, FutureNCCL gets a new stream from device's pool using at::cuda::getStreamFromPool to run callback and before invoking the callback inline it synchronizes WorkNCCL's stream by callback's stream not the default stream.

Differential Revision: D23055807

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 11, 2020
… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 11, 2020
Pull Request resolved: #42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.
ghstack-source-id: 109675502

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This overall looks great, have a few comments/questions inline.

// wrapper to synchronize streams appropriately and it mostly enables
// the async programming model of CUDA while trying to adhere to the
// Future interface.
// Future interface. FutureNCCL does not support NCCL_BLOCKING_WAIT flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this a bit? If I do run with NCCL_BLOCKING_WAIT, does this mean that if I call .wait() on a future returned by get_future() it still won't block? Would we want to add this support later on (seems like we would), if so can we file a GH issue? Same goes for barrier() - what wouldn't work if we did barrier() with FutureNCCL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of NCCL_BLOCKING_WAIT=0 it will just wait until streams are synchronized and it will return before the whole NCCL kernel completed on the GPU.
My understanding is that in case of barrier() and `NCCL_BLOCKING_WAIT, then callback will become even more complicated. Any thoughts @pritamdamania87, do we wanna support those later?

@rohan-varma
Copy link
Contributor

Can we describe the perf issue with the previous solution and why this commit fixes it (basically no longer blocking default stream on allreduce in then(), which eliminated overlap)? Would be useful context for those who come across this in the future.

@dr-ci
Copy link

dr-ci bot commented Aug 11, 2020

💊 CI failures summary and remediations

As of commit 22f629b (more details on the Dr. CI page):


  • 1/2 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 1/2 broken upstream at merge base 5d608d4 since Aug 17

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 33 times.

… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 13, 2020
Pull Request resolved: #42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.
ghstack-source-id: 109805861

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
Comment on lines 781 to 782
std::shared_ptr<at::cuda::CUDAEvent> cudaEvent = {cudaEvents_,
&(*cudaEvents_)[0]};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this and is this even correct? Shouldn't we just pass cudaEvents_ to FutureNCCL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to address, @rohan-varma's comment #42869 (comment).

We just have one event, so probably using a vector isn't needed. I was just trying to create a pointer that shares the ownership with cudaEvents_ and points to its first element. Since at::cuda::CUDAEvent's constructor is deleted, I need to be careful here, but probably that's not an elegant solution and linter gives warning.

If using a vector is okay, I can just revert this and keep using vector inside FutureNCCL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's just pass a vector since I'm not sure if it's possible to create a shared_ptr pointing to one element of a shared_ptr vector.

Comment on lines 754 to 755
Note that ``fut.done()`` returns if work's NCCL streams were synchronized with PyTorch's
default device streams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be right? In completed() we actually check if the work has completed on the GPU using cudaEventQuery.

Comment on lines 781 to 782
std::shared_ptr<at::cuda::CUDAEvent> cudaEvent = {cudaEvents_,
&(*cudaEvents_)[0]};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's just pass a vector since I'm not sure if it's possible to create a shared_ptr pointing to one element of a shared_ptr vector.

… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
@sinannasir
Copy link
Contributor Author

Can we describe the perf issue with the previous solution and why this commit fixes it (basically no longer blocking default stream on allreduce in then(), which eliminated overlap)? Would be useful context for those who come across this in the future.

I agree. I included additional small description in the documentation. I'll update the summary accordingly and give more details.

… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 15, 2020
Pull Request resolved: #42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.
ghstack-source-id: 109993778

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

@pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called.

**Ongoing discussions:**
* Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay.
* Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 17, 2020
Pull Request resolved: #42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110091730

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 18, 2020
Pull Request resolved: #42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110175728

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
… efficiency."


We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 19, 2020
Pull Request resolved: #42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6e1127e.

@facebook-github-bot facebook-github-bot deleted the gh/sinannasir/10/head branch August 23, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants