-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] Changed FutureNCCL's then callback logic for better efficiency. #42869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
Pull Request resolved: #42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. ghstack-source-id: 109675502 Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This overall looks great, have a few comments/questions inline.
| // wrapper to synchronize streams appropriately and it mostly enables | ||
| // the async programming model of CUDA while trying to adhere to the | ||
| // Future interface. | ||
| // Future interface. FutureNCCL does not support NCCL_BLOCKING_WAIT flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this a bit? If I do run with NCCL_BLOCKING_WAIT, does this mean that if I call .wait() on a future returned by get_future() it still won't block? Would we want to add this support later on (seems like we would), if so can we file a GH issue? Same goes for barrier() - what wouldn't work if we did barrier() with FutureNCCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in case of NCCL_BLOCKING_WAIT=0 it will just wait until streams are synchronized and it will return before the whole NCCL kernel completed on the GPU.
My understanding is that in case of barrier() and `NCCL_BLOCKING_WAIT, then callback will become even more complicated. Any thoughts @pritamdamania87, do we wanna support those later?
|
Can we describe the perf issue with the previous solution and why this commit fixes it (basically no longer blocking default stream on allreduce in then(), which eliminated overlap)? Would be useful context for those who come across this in the future. |
💊 CI failures summary and remediationsAs of commit 22f629b (more details on the Dr. CI page):
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 33 times. |
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
Pull Request resolved: #42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. ghstack-source-id: 109805861 Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
| std::shared_ptr<at::cuda::CUDAEvent> cudaEvent = {cudaEvents_, | ||
| &(*cudaEvents_)[0]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this and is this even correct? Shouldn't we just pass cudaEvents_ to FutureNCCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to address, @rohan-varma's comment #42869 (comment).
We just have one event, so probably using a vector isn't needed. I was just trying to create a pointer that shares the ownership with cudaEvents_ and points to its first element. Since at::cuda::CUDAEvent's constructor is deleted, I need to be careful here, but probably that's not an elegant solution and linter gives warning.
If using a vector is okay, I can just revert this and keep using vector inside FutureNCCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think let's just pass a vector since I'm not sure if it's possible to create a shared_ptr pointing to one element of a shared_ptr vector.
torch/csrc/distributed/c10d/init.cpp
Outdated
| Note that ``fut.done()`` returns if work's NCCL streams were synchronized with PyTorch's | ||
| default device streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be right? In completed() we actually check if the work has completed on the GPU using cudaEventQuery.
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
| std::shared_ptr<at::cuda::CUDAEvent> cudaEvent = {cudaEvents_, | ||
| &(*cudaEvents_)[0]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think let's just pass a vector since I'm not sure if it's possible to create a shared_ptr pointing to one element of a shared_ptr vector.
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
I agree. I included additional small description in the documentation. I'll update the summary accordingly and give more details. |
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
Pull Request resolved: #42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritam realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. ghstack-source-id: 109993778 Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. @pritamdamania87 realized that we should run the callback in `then` on a different stream and synchronize the NCCL stream to that stream instead and not the default device stream. Now a new `FutureNCCL` is returned from `then` hold the return value of the callback and new cudaEvents that recorded the stream that guards this callback.. This new FutureNCCL actually synchronizes on the new stream we used for the callback when its wait() is called. **Ongoing discussions:** * Calling `fut.wait()` (instead of `fut.value()`) inside the callback of the allreduce then divide hook is okay. * Calling `fut.wait()` if a new futureNCCL object is defined inside then callback is essential to get correct results. It could be fine in terms of performance since within the scope of then callback the default stream becomes the stream that guards the callback but not PyTorch's default device stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
Pull Request resolved: #42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110091730 Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
Pull Request resolved: #42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110175728 Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
… efficiency." We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/) [ghstack-poisoned]
Pull Request resolved: #42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110208431 Differential Revision: [D23055807](https://our.internmc.facebook.com/intern/diff/D23055807/)
|
This pull request has been merged in 6e1127e. |
Stack from ghstack:
We realized that when we invoke a simple callback that divides the tensors by
world_sizeafterallreduce, the performance was almost 50% lower in terms of QPS compared to the case where a simpleallreducehook is used with nothencallback.The main problem was as we call
work.wait()before invokingthencallback, we were synchronizingwork's stream with the default PyTorch stream insiderunHookand stalling the backward computation.In that PR, we ensure that FutureNCCL's
thencallback is not stalling the backward computation. Assuming single-process single-device,FutureNCCLgets a new stream from device's pool usingat::cuda::getStreamFromPoolto runcallbackand before invoking thecallbackinline it synchronizesWorkNCCL's stream by callback's stream not the default stream.Differential Revision: D23055807