-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Process group base class and Gloo implementation #7628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
teng-li
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pietern Nice work!
I haven't looked through everything yet, and understand it's a WIP. I will use the base class to get other backends started while you continue working on this. Some comments while I was reading through the code (not everything yet).
Will review later when you finish the WIP
torch/lib/c10d/ProcessGroupGloo.hpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.hpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.hpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.hpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This is copied from THD's DataChannel. I figured instead of having the Python side calling it a process group and the C++ side calling it a data channel we can use the same name as the Python side. This does not yet include all collective ops that will be supported and serves just as a starting point.
This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new). The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions at the same time, or a deadlock may happen. TODO(pietern): Describe caching behavior
89df263 to
c511c54
Compare
|
Rebased |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM.
The locking/CV story in ProcessGroupGloo is quite complicated (especially around the cache), and it seems like we could reduce the contention on a global lock, and simplify it a bit, if only we kept a std::mutex as part of the cache entries (the algorithms would effectively never leave the cache, and you'd be blocked for exactly as long as you need to).
torch/lib/c10d/CMakeLists.txt
Outdated
| target_include_directories(process_group PUBLIC ${ATEN_INCLUDE_DIR}) | ||
| target_link_libraries(process_group PUBLIC ${ATEN_LIBRARIES}) | ||
|
|
||
| add_library(process_group_gloo ProcessGroupGloo.cpp) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroup.hpp
Outdated
| // it adds an asychronous wait for the internal stream | ||
| // (cudaEventSynchronize). This way we retain the ability to write | ||
| // sequential code that executes asynchronously, without requiring the | ||
| // caller to perform explicit synchronization. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| virtual bool wait() = 0; | ||
|
|
||
| // Returns exception if wait() returned false. | ||
| virtual const std::exception& exception() const = 0; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
|
||
| // Must not be copyable | ||
| AlgorithmEntry& operator=(const AlgorithmEntry&) = delete; | ||
| AlgorithmEntry(const AlgorithmEntry&) = delete; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.hpp
Outdated
|
|
||
| void initialize(); | ||
|
|
||
| void initialize(Options& options); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
| auto& srcSizes = key.srcSizes; | ||
| entry->src.resize(srcSizes.size()); | ||
| for (int i = 0; i < srcSizes.size(); i++) { | ||
| entry->src[i] = at::zeros(*key.type, at::IntList(srcSizes[i])); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
| entry->src[i] = at::zeros(*key.type, at::IntList(srcSizes[i])); | ||
| } | ||
|
|
||
| return std::move(entry); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
| // Grab entry from the cache and return it. | ||
| auto entry = std::move(it->second); | ||
| cache_.erase(key); | ||
| return std::move(entry); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
| std::unique_lock<std::mutex> lock(m_); | ||
| queue_.push_back(std::make_tuple(std::move(entry), work)); | ||
| queueProduceCV_.notify_one(); | ||
| return std::move(work); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
| } | ||
|
|
||
| // Define how to run the algorithm and copy back results | ||
| entry->run = [tensors](EntryType& entry) mutable { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
d3bacfa to
ad8a2c1
Compare
|
Thanks for the comprehensive review @apaszke. I change the cache behavior to no longer std::move a unique_ptr around but keep it in the cache and use a raw pointer to refer to it. This way the workers only need to acquire a mutex on the queue. When they're done they mark the entry as done and notify the CV associated with the entry. If in the mean time a new caller comes in and needs to use it, it will block on this done flag being set. This is close to as minimal as it gets. |
|
@pytorchbot retest this please |
| ) | ||
|
|
||
| add_library(c10d ${C10D_SRCS}) | ||
| target_compile_options(c10d PUBLIC "-std=c++11") |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| const AllreduceOptions& opts = AllreduceOptions()) = 0; | ||
|
|
||
| protected: | ||
| const int rank_; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| : ProcessGroup(rank, size), store_(new GlooStore(store)), stop_(false) { | ||
| auto& devices = options.devices; | ||
| if (devices.empty()) { | ||
| devices.push_back(::gloo::transport::tcp::CreateDevice("localhost")); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| return construct(key); | ||
| if (it == cache_.end()) { | ||
| cache_[key] = construct(key); | ||
| it = cache_.find(key); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| cache_.erase(key); | ||
| return entry; | ||
| // Mark entry in use | ||
| entry->busy = true; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
|
||
| // Define how to run the algorithm and copy back results | ||
| entry->run = [tensors](EntryType& entry) mutable { | ||
| entry->run = [=]() mutable { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/lib/c10d/ProcessGroupGloo.hpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Whoops, I commented on the older commits, and GitHub is hiding all my comments for some reason. I think some of them still apply, so please take a look. They're not really blocking in any way, but it would be good to document what each mutex protects, and our entry mutex strategy could have been simplified. |
|
Thanks for the comments @apaszke. Adding a comment for the mutex and then merging. |
|
Ehh, I had already removed it. It's good as is. |
…e2_core_hip * 'caffe2_core_hip' of github.com:petrex/pytorch: (24 commits) Allow empty storage for the 'Edge' class. (pytorch#7595) Process group base class and Gloo implementation (pytorch#7628) _LRSchedulers getstate include optimizer info (pytorch#7757) [PyTorch] [gradcheck] change backward() to grad() (pytorch#7710) Update test_nn.py (pytorch#7787) Define general default scheduler for TBB and fix ppc64le bug (pytorch#7761) Add support for accepting Tensor as input in clip_grad_* functions. (pytorch#7769) [Easy] Remove unused code (pytorch#7782) Update tbb (pytorch#7734) Add @generated annotation (pytorch#7780) fix legacy comment after variable tensor merge (pytorch#7771) Revert pytorch#7750 and pytorch#7762 to fix Windows CI on master (pytorch#7772) Temporarily disable build env check (pytorch#7768) Add missing brace (pytorch#7762) [C++ API] Add backward() to Tensor and Variable (pytorch#7750) [auto] Update onnx to d43b550 - Fix .gitignore and add missing files (onnx/onnx#1005) onnx/onnx@d43b550 [auto] Update onnx to ea1aa13 - add tests for reduce ops (onnx/onnx#675) onnx/onnx@ea1aa13 include cudnn_h (pytorch#7749) [C++ API] Using new registration mechanism (pytorch#7663) [auto] Update onnx to 5dd68e6 - Add a util function: polish_model (onnx/onnx#1000) onnx/onnx@5dd68e6 ...
This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new). The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen. The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.
This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new).
The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen.
The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.