-
Notifications
You must be signed in to change notification settings - Fork 26.3k
add device to CUDAEvent #9415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add device to CUDAEvent #9415
Conversation
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments; one perf one that's most important.
torch/lib/c10d/CUDAUtils.cpp
Outdated
| CUDAEvent::~CUDAEvent() { | ||
| // cudaEventDestroy must run on the same device of the event, | ||
| // otherwise it creates a context on default device as well. | ||
| at::DeviceGuard guard(device_); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/CUDAUtils.hpp
Outdated
| // Must be move assignable. | ||
| CUDAEvent& operator=(CUDAEvent&& other) { | ||
| std::swap(event_, other.event_); | ||
| device_ = other.device_; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/CUDAUtils.hpp
Outdated
| // Must be move constructable. | ||
| CUDAEvent(CUDAEvent&& other) { | ||
| std::swap(event_, other.event_); | ||
| device_ = other.device_; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/CUDAUtils.hpp
Outdated
| void setDevice(int device) { | ||
| device_ = device; | ||
| } | ||
|
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/CUDAUtils.cpp
Outdated
| C10D_CUDA_CHECK(cudaEventCreateWithFlags(&event.event_, flags)); | ||
| int current_device; | ||
| C10D_CUDA_CHECK(cudaGetDevice(¤t_device)); | ||
| event.setDevice(current_device); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@ailzhang unfortunately yes, this is expected behavior. I have a general question though, why does c10d has its own wrappers for CUDAStream and CUDAEvent? Can't is use ATen's for CUDAStream, and IMO it would make sense to add CUDAEvent wrapper to ATen too. cc @mruberry, @goldsborough |
|
@ngimel The wrappers weren't there a month (or two) ago. We should switch to the ones in ATen where possible. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
#8354 (which will be updated shortly) adds CUDAEvent to ATen. @goldsborough is adding a relevant CUDAGuard now, too. It is straightforward to convert the c10d stream and event wrappers to use them (I'll make sure we do that as a follow-up) we can also use them to extend the stream API to support additional synchronization behavior. Note that c10d wraps a THCStream* which is a CUDAStreamInternal* (I added a typedef for it when porting streams to ATen). We should probably just make c10d::CUDAStream be an at::CUDAStream instead of defining another wrapper. |
|
Thanks for the update @mruberry. I appreciate you updating c10d when the ATen wrappers land. I assume you'll include the appropriate guarding there as well, which was the reason for this PR? |
|
I don't think there's any harm in putting it in now, but yes, if you like we can put it in for you. |
|
Thanks -- this PR addresses the additional context on GPU 0 when deallocating from a thread without a device set, so it would be necessary to have in the ATen CUDAEvent wrapper as well. |
|
@pytorchbot retest this please |
|
@pytorchbot retest this please |
* upstream/master: (24 commits) Implement tensor weak references (pytorch#9363) Nuke TestCollectEnv (pytorch#9459) Add test case for segmentation fault fix in grad_fn (pytorch#9457) Add peephole optimization for type_as operators. (pytorch#9316) Fix out-of-range error for test_neg (pytorch#9431) add depthwise conv support for mkldnn (pytorch#8782) Refactor `_log_sum_exp` (pytorch#9173) Add ModuleDict and ParameterDict containers (pytorch#8463) Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (pytorch#9358) Introducing IsInf (pytorch#9169) add device to CUDAEvent (pytorch#9415) Make localScalar error message more intuitive (pytorch#9443) Only accept continguous tensors in TopK for cuda (pytorch#9441) Add support for .norm() pytorch onnx export and ReduceL1/ReduceL2 caffe2 operators (pytorch#9299) Only view() rhs of index_put if we need to (pytorch#9424) Add BatchBucketizeOp in caffe2 (pytorch#9385) Implementation of Wngrad optimizer caffe2 python wrapper and unit test on least square regression (pytorch#9001) Implementation and operator test for Wngrad optimizer (pytorch#8999) Fix segmentation fault in grad_fn (pytorch#9292) update docs (pytorch#9423) ...
Summary: This PR add a device_ member to CUDAEvent. This is necessary since if we create a cudaEvent on one device but destroy it from another, it also creates an additional context on that device. So this device information is needed to guard the cudaEventDestroy. (cc: ngimel is this expected behavior? I can provide a simple cu script to repro this). c10d tests are probably not in CI yet, please let me know how the test are run and I could double check. Thanks pietern apaszke for help debugging! Pull Request resolved: pytorch#9415 Reviewed By: apaszke Differential Revision: D8839688 Pulled By: ailzhang fbshipit-source-id: b950ba37d57b9e3c5fe71726ec92f6a9601c4d0e
Summary: This PR add a device_ member to CUDAEvent. This is necessary since if we create a cudaEvent on one device but destroy it from another, it also creates an additional context on that device. So this device information is needed to guard the cudaEventDestroy. (cc: ngimel is this expected behavior? I can provide a simple cu script to repro this). c10d tests are probably not in CI yet, please let me know how the test are run and I could double check. Thanks pietern apaszke for help debugging! Pull Request resolved: pytorch#9415 Reviewed By: apaszke Differential Revision: D8839688 Pulled By: ailzhang fbshipit-source-id: b950ba37d57b9e3c5fe71726ec92f6a9601c4d0e
Summary: This PR add a device_ member to CUDAEvent. This is necessary since if we create a cudaEvent on one device but destroy it from another, it also creates an additional context on that device. So this device information is needed to guard the cudaEventDestroy. (cc: ngimel is this expected behavior? I can provide a simple cu script to repro this). c10d tests are probably not in CI yet, please let me know how the test are run and I could double check. Thanks pietern apaszke for help debugging! Pull Request resolved: pytorch#9415 Reviewed By: apaszke Differential Revision: D8839688 Pulled By: ailzhang fbshipit-source-id: b950ba37d57b9e3c5fe71726ec92f6a9601c4d0e
Summary: After submitting PR #9726, PR #10581 created a different CUDAEvent class. The CUDAEvent proposed in #9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by #10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR #8354. In particular, this PR: - Adds set_device() to CUDAContext - Adds three CUDAEvent tests to stream_test.cpp - Fixes three bugs: - Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards). - Moving an event would cause a segfault. - Events were not destroyed on the device they were created on. See PR #9415 (pietern) - Adds the happened() and recordOnce() functions - Changes the record() functions to not be const - Adds additional assertions to verify correctness This PR does not: - Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR) Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here: https://github.com/pytorch/pytorch/blob/542aadd9a7609892e207c1e15de08a975b697752/aten/src/ATen/native/cudnn/RNN.cpp#L1130-L1132 and that file is the only one which currently uses the ATen CUDAEvent. Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs). I have decided to separate this from PR #8354 since while it's required for PR #8354 the changes are, clearly, of independent interest. PR #8354 has a new dependency on this one, however. I am closing PR #9726 in favor of this PR. apaszke ezyang pietern Pull Request resolved: #11293 Differential Revision: D9665836 Pulled By: soumith fbshipit-source-id: a1513fa4f9761e2f304d126e402f6b6950e1c1d2
Summary: After submitting PR pytorch#9726, PR pytorch#10581 created a different CUDAEvent class. The CUDAEvent proposed in pytorch#9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by pytorch#10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR pytorch#8354. In particular, this PR: - Adds set_device() to CUDAContext - Adds three CUDAEvent tests to stream_test.cpp - Fixes three bugs: - Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards). - Moving an event would cause a segfault. - Events were not destroyed on the device they were created on. See PR pytorch#9415 (pietern) - Adds the happened() and recordOnce() functions - Changes the record() functions to not be const - Adds additional assertions to verify correctness This PR does not: - Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR) Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here: https://github.com/pytorch/pytorch/blob/542aadd9a7609892e207c1e15de08a975b697752/aten/src/ATen/native/cudnn/RNN.cpp#L1130-L1132 and that file is the only one which currently uses the ATen CUDAEvent. Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs). I have decided to separate this from PR pytorch#8354 since while it's required for PR pytorch#8354 the changes are, clearly, of independent interest. PR pytorch#8354 has a new dependency on this one, however. I am closing PR pytorch#9726 in favor of this PR. apaszke ezyang pietern Pull Request resolved: pytorch#11293 Differential Revision: D9665836 Pulled By: soumith fbshipit-source-id: a1513fa4f9761e2f304d126e402f6b6950e1c1d2
This PR add a device_ member to CUDAEvent. This is necessary since if we create a cudaEvent on one device but destroy it from another, it also creates an additional context on that device. So this device information is needed to guard the cudaEventDestroy. (cc: @ngimel is this expected behavior? I can provide a simple cu script to repro this).
c10d tests are probably not in CI yet, please let me know how the test are run and I could double check.
Thanks @pietern @apaszke for help debugging!