Process group base class and Gloo implementation #7628

pietern · 2018-05-16T21:41:36Z

This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new).

The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen.

The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.

teng-li

@pietern Nice work!

I haven't looked through everything yet, and understand it's a WIP. I will use the base class to get other backends started while you continue working on this. Some comments while I was reading through the code (not everything yet).

Will review later when you finish the WIP

torch/lib/c10d/ProcessGroupGloo.hpp

torch/lib/c10d/ProcessGroupGloo.cpp

torch/lib/c10d/ProcessGroupGloo.hpp

torch/lib/c10d/ProcessGroupGloo.cpp

torch/lib/c10d/ProcessGroupGloo.hpp

torch/lib/c10d/ProcessGroupGloo.cpp

This is copied from THD's DataChannel. I figured instead of having the Python side calling it a process group and the C++ side calling it a data channel we can use the same name as the Python side. This does not yet include all collective ops that will be supported and serves just as a starting point.

This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new). The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions at the same time, or a deadlock may happen. TODO(pietern): Describe caching behavior

pietern · 2018-05-18T18:23:51Z

Rebased

apaszke

Mostly LGTM.

The locking/CV story in ProcessGroupGloo is quite complicated (especially around the cache), and it seems like we could reduce the contention on a global lock, and simplify it a bit, if only we kept a std::mutex as part of the cache entries (the algorithms would effectively never leave the cache, and you'd be blocked for exactly as long as you need to).

torch/lib/c10d/CMakeLists.txt

+target_include_directories(process_group PUBLIC ${ATEN_INCLUDE_DIR})
+target_link_libraries(process_group PUBLIC ${ATEN_LIBRARIES})
+
+add_library(process_group_gloo ProcessGroupGloo.cpp)


torch/lib/c10d/ProcessGroup.hpp

+// it adds an asychronous wait for the internal stream
+// (cudaEventSynchronize). This way we retain the ability to write
+// sequential code that executes asynchronously, without requiring the
+// caller to perform explicit synchronization.


torch/lib/c10d/ProcessGroup.hpp

+    virtual bool wait() = 0;
+
+    // Returns exception if wait() returned false.
+    virtual const std::exception& exception() const = 0;


torch/lib/c10d/ProcessGroupGloo.hpp

+
+  // Must not be copyable
+  AlgorithmEntry& operator=(const AlgorithmEntry&) = delete;
+  AlgorithmEntry(const AlgorithmEntry&) = delete;


torch/lib/c10d/ProcessGroupGloo.hpp

+
+  void initialize();
+
+  void initialize(Options& options);


torch/lib/c10d/ProcessGroupGloo.cpp

+  auto& srcSizes = key.srcSizes;
+  entry->src.resize(srcSizes.size());
+  for (int i = 0; i < srcSizes.size(); i++) {
+    entry->src[i] = at::zeros(*key.type, at::IntList(srcSizes[i]));


torch/lib/c10d/ProcessGroupGloo.cpp

+    entry->src[i] = at::zeros(*key.type, at::IntList(srcSizes[i]));
+  }
+
+  return std::move(entry);


torch/lib/c10d/ProcessGroupGloo.cpp

+  // Grab entry from the cache and return it.
+  auto entry = std::move(it->second);
+  cache_.erase(key);
+  return std::move(entry);


torch/lib/c10d/ProcessGroupGloo.cpp

+  std::unique_lock<std::mutex> lock(m_);
+  queue_.push_back(std::make_tuple(std::move(entry), work));
+  queueProduceCV_.notify_one();
+  return std::move(work);


torch/lib/c10d/ProcessGroupGloo.cpp

+  }
+
+  // Define how to run the algorithm and copy back results
+  entry->run = [tensors](EntryType& entry) mutable {


pietern · 2018-05-22T20:11:27Z

Thanks for the comprehensive review @apaszke. I change the cache behavior to no longer std::move a unique_ptr around but keep it in the cache and use a raw pointer to refer to it. This way the workers only need to acquire a mutex on the queue. When they're done they mark the entry as done and notify the CV associated with the entry. If in the mean time a new caller comes in and needs to use it, it will block on this done flag being set. This is close to as minimal as it gets.

pietern · 2018-05-22T22:44:57Z

@pytorchbot retest this please

torch/lib/c10d/CMakeLists.txt

+  )
+
+add_library(c10d ${C10D_SRCS})
+target_compile_options(c10d PUBLIC "-std=c++11")


torch/lib/c10d/ProcessGroup.hpp

+      const AllreduceOptions& opts = AllreduceOptions()) = 0;
+
+ protected:
+  const int rank_;


torch/lib/c10d/ProcessGroupGloo.cpp

+    : ProcessGroup(rank, size), store_(new GlooStore(store)), stop_(false) {
+  auto& devices = options.devices;
+  if (devices.empty()) {
+    devices.push_back(::gloo::transport::tcp::CreateDevice("localhost"));


torch/lib/c10d/ProcessGroupGloo.cpp

-    return construct(key);
+  if (it == cache_.end()) {
+    cache_[key] = construct(key);
+    it = cache_.find(key);


torch/lib/c10d/ProcessGroupGloo.cpp

-  cache_.erase(key);
-  return entry;
+  // Mark entry in use
+  entry->busy = true;


torch/lib/c10d/ProcessGroupGloo.cpp


  // Define how to run the algorithm and copy back results
-  entry->run = [tensors](EntryType& entry) mutable {
+  entry->run = [=]() mutable {


torch/lib/c10d/ProcessGroupGloo.hpp

apaszke · 2018-05-23T10:26:02Z

Whoops, I commented on the older commits, and GitHub is hiding all my comments for some reason. I think some of them still apply, so please take a look.

They're not really blocking in any way, but it would be good to document what each mutex protects, and our entry mutex strategy could have been simplified.

pietern · 2018-05-23T15:53:35Z

Thanks for the comments @apaszke. Adding a comment for the mutex and then merging.

pietern · 2018-05-23T16:00:59Z

Ehh, I had already removed it. It's good as is.

@generated

…e2_core_hip * 'caffe2_core_hip' of github.com:petrex/pytorch: (24 commits) Allow empty storage for the 'Edge' class. (pytorch#7595) Process group base class and Gloo implementation (pytorch#7628) _LRSchedulers getstate include optimizer info (pytorch#7757) [PyTorch] [gradcheck] change backward() to grad() (pytorch#7710) Update test_nn.py (pytorch#7787) Define general default scheduler for TBB and fix ppc64le bug (pytorch#7761) Add support for accepting Tensor as input in clip_grad_* functions. (pytorch#7769) [Easy] Remove unused code (pytorch#7782) Update tbb (pytorch#7734) Add @generated annotation (pytorch#7780) fix legacy comment after variable tensor merge (pytorch#7771) Revert pytorch#7750 and pytorch#7762 to fix Windows CI on master (pytorch#7772) Temporarily disable build env check (pytorch#7768) Add missing brace (pytorch#7762) [C++ API] Add backward() to Tensor and Variable (pytorch#7750) [auto] Update onnx to d43b550 - Fix .gitignore and add missing files (onnx/onnx#1005) onnx/onnx@d43b550 [auto] Update onnx to ea1aa13 - add tests for reduce ops (onnx/onnx#675) onnx/onnx@ea1aa13 include cudnn_h (pytorch#7749) [C++ API] Using new registration mechanism (pytorch#7663) [auto] Update onnx to 5dd68e6 - Add a util function: polish_model (onnx/onnx#1000) onnx/onnx@5dd68e6 ...

This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new). The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen. The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.

pietern added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 16, 2018

pietern requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners May 16, 2018 21:41

pietern changed the title ~~[c10d] Process group base class and gloo implementation~~ [c10d][wip] Process group base class and gloo implementation May 16, 2018

pietern requested a review from teng-li May 16, 2018 21:42

onnxbot-worker-1 mentioned this pull request May 16, 2018

[auto] pytorch-pr-7628 onnxbot/onnx-fb-universe#2106

Open

teng-li reviewed May 17, 2018

View reviewed changes

pietern added 7 commits May 18, 2018 11:11

Address comments

c12b1b9

Add simple allreduce correctness test

0f31999

Add broadcast

cae5ab0

Comments

2eeed49

Include ATen dir for store compile

c511c54

pietern force-pushed the c10d-process-group branch from 89df263 to c511c54 Compare May 18, 2018 18:23

pietern changed the title ~~[c10d][wip] Process group base class and gloo implementation~~ Process group base class and gloo implementation May 18, 2018

pietern added 2 commits May 18, 2018 12:32

Run clang-format on files in this PR

2c89ae5

Fix find_package calls

9eae424

pietern mentioned this pull request May 18, 2018

Support CUDA tensors in ProcessGroupGloo #7694

Merged

apaszke approved these changes May 19, 2018

View reviewed changes

pietern added 5 commits May 22, 2018 13:08

Don't return std::move

ee3e8e8

Pass unique_lock to runSingle by reference

1182463

Mention default constructor is not optional

499cf20

Comments

9a6399f

Initialize/destroy from constructor/destructor

14a0454

Use hash helpers from torch/csrc/utils

ad8a2c1

pietern force-pushed the c10d-process-group branch from d3bacfa to ad8a2c1 Compare May 22, 2018 20:09

clang-format

f81bd35

pietern changed the title ~~Process group base class and gloo implementation~~ Process group base class and Gloo implementation May 22, 2018

teng-li reviewed May 23, 2018

View reviewed changes

apaszke reviewed May 23, 2018

View reviewed changes

pietern merged commit ee5e474 into pytorch:master May 23, 2018

pietern deleted the c10d-process-group branch May 23, 2018 16:02

Process group base class and Gloo implementation #7628

Process group base class and Gloo implementation #7628

Uh oh!

Conversation

pietern commented May 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teng-li left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

pietern commented May 18, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

pietern commented May 16, 2018 •

edited

Loading