Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented May 16, 2018

This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new).

The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen.

The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.

@pietern pietern added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 16, 2018
@pietern pietern changed the title [c10d] Process group base class and gloo implementation [c10d][wip] Process group base class and gloo implementation May 16, 2018
@pietern pietern requested a review from teng-li May 16, 2018 21:42
Copy link
Contributor

@teng-li teng-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern Nice work!

I haven't looked through everything yet, and understand it's a WIP. I will use the base class to get other backends started while you continue working on this. Some comments while I was reading through the code (not everything yet).

Will review later when you finish the WIP

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

pietern added 7 commits May 18, 2018 11:11
This is copied from THD's DataChannel. I figured instead of having the
Python side calling it a process group and the C++ side calling it a
data channel we can use the same name as the Python side.

This does not yet include all collective ops that will be supported and
serves just as a starting point.
This is a starting point and only implements allreduce for CPU
tensors. It includes most base functionality like algorithm
caching (similar approach as taken in the THD GlooCache) and
multi-threaded execution (new).

The expectation is that function calls on the process group class are
globally serialized. They execute collective functions, so members of
the collective must call the same functions at the same time, or a
deadlock may happen.

TODO(pietern): Describe caching behavior
@pietern pietern force-pushed the c10d-process-group branch from 89df263 to c511c54 Compare May 18, 2018 18:23
@pietern
Copy link
Contributor Author

pietern commented May 18, 2018

Rebased

@pietern pietern changed the title [c10d][wip] Process group base class and gloo implementation Process group base class and gloo implementation May 18, 2018
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM.

The locking/CV story in ProcessGroupGloo is quite complicated (especially around the cache), and it seems like we could reduce the contention on a global lock, and simplify it a bit, if only we kept a std::mutex as part of the cache entries (the algorithms would effectively never leave the cache, and you'd be blocked for exactly as long as you need to).

target_include_directories(process_group PUBLIC ${ATEN_INCLUDE_DIR})
target_link_libraries(process_group PUBLIC ${ATEN_LIBRARIES})

add_library(process_group_gloo ProcessGroupGloo.cpp)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

// it adds an asychronous wait for the internal stream
// (cudaEventSynchronize). This way we retain the ability to write
// sequential code that executes asynchronously, without requiring the
// caller to perform explicit synchronization.

This comment was marked as off-topic.

This comment was marked as off-topic.

virtual bool wait() = 0;

// Returns exception if wait() returned false.
virtual const std::exception& exception() const = 0;

This comment was marked as off-topic.

This comment was marked as off-topic.


// Must not be copyable
AlgorithmEntry& operator=(const AlgorithmEntry&) = delete;
AlgorithmEntry(const AlgorithmEntry&) = delete;

This comment was marked as off-topic.

This comment was marked as off-topic.


void initialize();

void initialize(Options& options);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

auto& srcSizes = key.srcSizes;
entry->src.resize(srcSizes.size());
for (int i = 0; i < srcSizes.size(); i++) {
entry->src[i] = at::zeros(*key.type, at::IntList(srcSizes[i]));

This comment was marked as off-topic.

This comment was marked as off-topic.

entry->src[i] = at::zeros(*key.type, at::IntList(srcSizes[i]));
}

return std::move(entry);

This comment was marked as off-topic.

This comment was marked as off-topic.

// Grab entry from the cache and return it.
auto entry = std::move(it->second);
cache_.erase(key);
return std::move(entry);

This comment was marked as off-topic.

This comment was marked as off-topic.

std::unique_lock<std::mutex> lock(m_);
queue_.push_back(std::make_tuple(std::move(entry), work));
queueProduceCV_.notify_one();
return std::move(work);

This comment was marked as off-topic.

This comment was marked as off-topic.

}

// Define how to run the algorithm and copy back results
entry->run = [tensors](EntryType& entry) mutable {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@pietern pietern force-pushed the c10d-process-group branch from d3bacfa to ad8a2c1 Compare May 22, 2018 20:09
@pietern
Copy link
Contributor Author

pietern commented May 22, 2018

Thanks for the comprehensive review @apaszke. I change the cache behavior to no longer std::move a unique_ptr around but keep it in the cache and use a raw pointer to refer to it. This way the workers only need to acquire a mutex on the queue. When they're done they mark the entry as done and notify the CV associated with the entry. If in the mean time a new caller comes in and needs to use it, it will block on this done flag being set. This is close to as minimal as it gets.

@pietern
Copy link
Contributor Author

pietern commented May 22, 2018

@pytorchbot retest this please

@pietern pietern changed the title Process group base class and gloo implementation Process group base class and Gloo implementation May 22, 2018
)

add_library(c10d ${C10D_SRCS})
target_compile_options(c10d PUBLIC "-std=c++11")

This comment was marked as off-topic.

This comment was marked as off-topic.

const AllreduceOptions& opts = AllreduceOptions()) = 0;

protected:
const int rank_;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

: ProcessGroup(rank, size), store_(new GlooStore(store)), stop_(false) {
auto& devices = options.devices;
if (devices.empty()) {
devices.push_back(::gloo::transport::tcp::CreateDevice("localhost"));

This comment was marked as off-topic.

This comment was marked as off-topic.

return construct(key);
if (it == cache_.end()) {
cache_[key] = construct(key);
it = cache_.find(key);

This comment was marked as off-topic.

This comment was marked as off-topic.

cache_.erase(key);
return entry;
// Mark entry in use
entry->busy = true;

This comment was marked as off-topic.

This comment was marked as off-topic.


// Define how to run the algorithm and copy back results
entry->run = [tensors](EntryType& entry) mutable {
entry->run = [=]() mutable {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented May 23, 2018

Whoops, I commented on the older commits, and GitHub is hiding all my comments for some reason. I think some of them still apply, so please take a look.

They're not really blocking in any way, but it would be good to document what each mutex protects, and our entry mutex strategy could have been simplified.

@pietern
Copy link
Contributor Author

pietern commented May 23, 2018

Thanks for the comments @apaszke. Adding a comment for the mutex and then merging.

@pietern
Copy link
Contributor Author

pietern commented May 23, 2018

Ehh, I had already removed it. It's good as is.

@pietern pietern merged commit ee5e474 into pytorch:master May 23, 2018
@pietern pietern deleted the c10d-process-group branch May 23, 2018 16:02
petrex pushed a commit to petrex/pytorch that referenced this pull request May 23, 2018
…e2_core_hip

* 'caffe2_core_hip' of github.com:petrex/pytorch: (24 commits)
  Allow empty storage for the 'Edge' class. (pytorch#7595)
  Process group base class and Gloo implementation (pytorch#7628)
  _LRSchedulers getstate include optimizer info (pytorch#7757)
  [PyTorch] [gradcheck] change backward() to grad() (pytorch#7710)
  Update test_nn.py (pytorch#7787)
  Define general default scheduler for TBB and fix ppc64le bug (pytorch#7761)
  Add support for accepting Tensor as input in clip_grad_*  functions. (pytorch#7769)
  [Easy] Remove unused code (pytorch#7782)
  Update tbb (pytorch#7734)
  Add @generated annotation (pytorch#7780)
  fix legacy comment after variable tensor merge (pytorch#7771)
  Revert pytorch#7750 and pytorch#7762 to fix Windows CI on master (pytorch#7772)
  Temporarily disable build env check (pytorch#7768)
  Add missing brace (pytorch#7762)
  [C++ API] Add backward() to Tensor and Variable  (pytorch#7750)
  [auto] Update onnx to d43b550 - Fix .gitignore and add missing files (onnx/onnx#1005) onnx/onnx@d43b550
  [auto] Update onnx to ea1aa13 - add tests for reduce ops (onnx/onnx#675) onnx/onnx@ea1aa13
  include cudnn_h (pytorch#7749)
  [C++ API] Using new registration mechanism (pytorch#7663)
  [auto] Update onnx to 5dd68e6 - Add a util function: polish_model (onnx/onnx#1000) onnx/onnx@5dd68e6
  ...
weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018
This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new).

The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen.

The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants