[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking #95715

eqy · 2023-02-28T17:51:50Z

Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:

TORCH_NCCL_USE_COMM_NONBLOCKING=1

CC @ptrblck

pytorch-bot · 2023-02-28T17:51:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95715

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit 8b5c091:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2023-03-08T00:33:32Z

torch/_C/_distributed_c10d.pyi

    ): ...
-    @staticmethod
    def _group_start() -> None: ...
-    @staticmethod


Removing static method as _group_end() might need to check the communicator map of the ProcessGroup to properly wait on collectives if nonblocking is used.

eqy · 2023-03-08T00:35:41Z

torch/csrc/cuda/nccl.cpp

+  if (!comm_nonblocking) {
+    NCCL_CHECK(ncclCommCount(comm, &numranks));
+  } else {
+    NCCL_CHECK_NONBLOCKING(ncclCommCount(comm, &numranks), _comm);


Might be unnecessary to also do a non-blocking check for ncclCommCount (unsure if there exists documentations on exactly which API calls might leave a communicator in an in-progress state).

Agree it is unnecessary. It is user responsibility to make sure comm is ready before accessing any attribute of it. (If comm is not ready, this call would actually error out rather than returning ncclInProgess.)

eqy · 2023-03-08T00:37:12Z

torch/csrc/distributed/c10d/NCCLUtils.hpp


+#if defined(NCCL_MAJOR) && (NCCL_MAJOR == 2) && defined(NCCL_MINOR) && \
+    (NCCL_MINOR >= 14)
+#define NCCL_HAS_COMM_NONBLOCKING


Not sure why this needs to be redefined here in order to work when a definition already exists in ProcessGroupNCCL.hpp.

Because NCCLUtils.hpp do not include ProcessGroupNCCL.hpp and these two are not in the same compilation unit?

nit: reminder for me to clean it.

eqy · 2023-03-08T00:40:07Z

CC @kwen2501 @ngimel

torch/csrc/cuda/nccl.cpp

eqy · 2023-03-14T17:02:40Z

@pytorchmergebot rebase

pytorchmergebot · 2023-03-14T17:04:46Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-03-14T17:04:52Z

Successfully rebased nccl_nonblocking onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout nccl_nonblocking && git pull --rebase)

kwen2501 · 2023-03-15T00:14:31Z

torch/csrc/cuda/nccl.cpp

+#ifdef NCCL_HAS_COMM_NONBLOCKING
+  ncclResult_t result = to_nccl_result(status);
+  while (result == ncclInProgress) {
+    ncclCommGetAsyncError(to_nccl_comm(comm), &result);


I wonder if to_nccl_comm(comm) is needed here.

Here is definition of to_nccl_comm:

ncclComm_t to_nccl_comm(torch::cuda::nccl::ncclComm_t var) { return reinterpret_cast<ncclComm_t>(var); }

It seems to me comm is already a ncclComm_t (the one defined by NCCL).

Side note:
We should remove the duplicated ncclComm_t definition in torch::cuda::nccl. It is making things complicated.
It is out of scope of this PR. We can do that later.

kwen2501 · 2023-03-15T00:33:31Z

torch/csrc/cuda/nccl.cpp

+static inline void NCCL_CHECK_NONBLOCKING(
+    ncclResult_t result,
+    ncclComm_t comm) {


It seems to me this should be the base case. I could be wrong though :)

kwen2501 · 2023-03-15T00:51:26Z

torch/csrc/cuda/nccl.cpp

+    for (const auto i : c10::irange(comms.size())) {
+      do {
+        ncclCommGetAsyncError(to_nccl_comm(comms[i]), &result);
+      } while (result == ncclInProgress);


I wonder which one should be the inner loop.

Would it be possible that a comm is hanging, while another already errors out, in which case we would miss catching the error here?

torch/csrc/cuda/nccl.cpp

eqy · 2023-04-14T01:04:34Z

@pytorchmergebot merge

pytorchmergebot · 2023-04-14T01:06:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-14T01:06:30Z

Merge failed

Reason: 3 jobs have failed, first few of them are: periodic / cuda11.7-py3.10-gcc7-sm86-periodic-dynamo-benchmarks / test (aot_eager_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu), periodic / cuda11.7-py3.10-gcc7-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 1, linux.g5.4xlarge.nvidia.gpu), inductor / cuda11.8-py3.10-gcc7-sm86 / test (inductor_torchbench_dynamic, 1, 1, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

eqy · 2023-04-14T02:01:24Z

@pytorchmergebot -f "assume inductor failures unrelated"

pytorch-bot · 2023-04-14T02:01:26Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'assume inductor failures unrelated' (choose from 'merge', 'revert', 'rebase', 'label', 'drci')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

eqy · 2023-04-14T02:01:36Z

@pytorchmergebot merge -f "assume inductor failures unrelated"

pytorchmergebot · 2023-04-14T02:03:25Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

#95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: #103264 Approved by: https://github.com/kwen2501

pytorch#95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp

…3925) #95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: #103925 Approved by: https://github.com/osalpekar

eqy requested review from H-Huang, awgu, fegin, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners February 28, 2023 17:51

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Feb 28, 2023

pytorchbot added the open source label Feb 28, 2023

eqy force-pushed the nccl_nonblocking branch 3 times, most recently from 1ccaf67 to 81ce78a Compare March 8, 2023 00:32

eqy changed the title ~~[DO NOT MERGE][WIP] Nonblocking NCCL Fault Tolerance/Checking~~ Nonblocking NCCL Fault Tolerance/Checking Mar 8, 2023

eqy commented Mar 8, 2023

View reviewed changes

eqy added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Mar 8, 2023

eqy changed the title ~~Nonblocking NCCL Fault Tolerance/Checking~~ [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking Mar 8, 2023

eqy commented Mar 8, 2023

View reviewed changes

torch/csrc/cuda/nccl.cpp Outdated Show resolved Hide resolved

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 8, 2023

pytorchmergebot force-pushed the nccl_nonblocking branch from 84c038a to 82f26eb Compare March 14, 2023 17:04

pytorchmergebot requested review from d4l3k and kiukchung as code owners March 14, 2023 17:04

kwen2501 reviewed Mar 15, 2023

View reviewed changes

pritamdamania87 reviewed Mar 17, 2023

View reviewed changes

torch/csrc/cuda/nccl.cpp Outdated Show resolved Hide resolved

eqy added 7 commits April 13, 2023 19:08

check ncclSend in nonblocking grouped case

3ae5791

remove branch for simple case

0cdb849

cleanup and remove branches in torch/csrc/nccl

45a6c59

more cleanup of nonblocking flag

2afb09e

lint

081b058

add timeout parse

915cbbe

add timeout function

7cadf83

eqy force-pushed the nccl_nonblocking branch from 1f07d9f to 7cadf83 Compare April 13, 2023 21:29

add constructor change to nonblocking version

8b5c091

pytorchmergebot removed the merging label Apr 14, 2023

pytorchmergebot added the merging label Apr 14, 2023

pytorchmergebot closed this in c4f81cb Apr 14, 2023

pritamdamania87 mentioned this pull request Jun 9, 2023

Ensure ncclCommAbort can abort stuck ncclCommInitRank #103264

Closed

pritamdamania87 mentioned this pull request Jun 20, 2023

[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank #103925

Closed

H-Huang mentioned this pull request Nov 8, 2023

NCCL p2p ops hung after communicator aborts #113281

Open

[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking #95715

[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking #95715

Uh oh!

Conversation

eqy commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95715

❌ 3 Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Mar 8, 2023

Uh oh!

Uh oh!

eqy commented Mar 14, 2023

Uh oh!

pytorchmergebot commented Mar 14, 2023

Uh oh!

pytorchmergebot commented Mar 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eqy commented Apr 14, 2023

Uh oh!

pytorchmergebot commented Apr 14, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 14, 2023

Merge failed

Uh oh!

eqy commented Apr 14, 2023

Uh oh!

pytorch-bot bot commented Apr 14, 2023

Uh oh!

eqy commented Apr 14, 2023

Uh oh!

pytorchmergebot commented Apr 14, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

eqy commented Feb 28, 2023 •

edited

Loading

pytorch-bot bot commented Feb 28, 2023 •

edited

Loading

kwen2501 Mar 15, 2023 •

edited

Loading