[NCCL] Tests for WorkNCCL::wait with Timeouts #40947

osalpekar · 2020-07-02T22:24:01Z

Stack from ghstack:

[Gloo] Tests for Gloo Async Work Wait-level Timeouts #41265 [Gloo] Tests for Gloo Async Work Wait-level Timeouts
[Gloo] Support work-level timeouts in ProcessGroupGloo #40948 [Gloo] Support work-level timeouts in ProcessGroupGloo
[NCCL] Tests for WorkNCCL::wait with Timeouts #40947 [NCCL] Tests for WorkNCCL::wait with Timeouts
[NCCL] Support Wait Timeout in ProcessGroupNCCL #40946 [NCCL] Support Wait Timeout in ProcessGroupNCCL
[NCCL] Add timeout to ProcessGroup Work Wait #40944 [NCCL] Add timeout to ProcessGroup Work Wait

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: D22173101

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]

dr-ci · 2020-07-02T23:02:32Z

💊 CI failures summary and remediations

As of commit 3e25ba8 (more details on the Dr. CI page):

4/4 failures possibly* introduced in this PR
- 1/4 non-CircleCI failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_windows_vs2019_py36_cuda10.1_test2 (1/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: expected a non-deterministic error, but it was not raised

  test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.013s) 
 
====================================================================== 
FAIL [0.024s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device 
    return efail_fn(slf, None, *args, **kwargs) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn 
    slf.fail('expected a non-deterministic error, but it was not raised') 
AssertionError: expected a non-deterministic error, but it was not raised 
 
---------------------------------------------------------------------- 
Ran 1800 tests in 690.609s 
 
FAILED (failures=1, skipped=90, expected failures=4) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-PackedSequenceTest-20200715200807.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAddRelu-20200715200807.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAvgPool-20200715200807.xml

pytorch_windows_vs2019_py36_cuda10.1_test1 (2/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: expected a non-deterministic error, but it was not raised

  test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.014s) 
 
====================================================================== 
FAIL [0.023s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device 
    return efail_fn(slf, None, *args, **kwargs) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn 
    slf.fail('expected a non-deterministic error, but it was not raised') 
AssertionError: expected a non-deterministic error, but it was not raised 
 
---------------------------------------------------------------------- 
Ran 1800 tests in 698.240s 
 
FAILED (failures=1, skipped=90, expected failures=4) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-PackedSequenceTest-20200715194100.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAddRelu-20200715194100.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAvgPool-20200715194100.xml

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (3/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 15 20:03:34 FAIL [0.036s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN)

Jul 15 20:03:32   test_sparse_default_std (__main__.TestNNInit) ... ok (0.007s) 
Jul 15 20:03:32   test_sparse_only_works_on_2d_inputs (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:03:33   test_trunc_normal (__main__.TestNNInit) ... ok (0.705s) 
Jul 15 20:03:34   test_uniform (__main__.TestNNInit) ... ok (0.895s) 
Jul 15 20:03:34   test_xavier_normal (__main__.TestNNInit) ... ok (0.104s) 
Jul 15 20:03:34   test_xavier_normal_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:03:34   test_xavier_uniform (__main__.TestNNInit) ... ok (0.078s) 
Jul 15 20:03:34   test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:03:34  
Jul 15 20:03:34 ====================================================================== 
Jul 15 20:03:34 FAIL [0.036s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
Jul 15 20:03:34 ---------------------------------------------------------------------- 
Jul 15 20:03:34 Traceback (most recent call last): 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper 
Jul 15 20:03:34     method(*args, **kwargs) 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper 
Jul 15 20:03:34     method(*args, **kwargs) 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 674, in efail_fn_no_device 
Jul 15 20:03:34     return efail_fn(slf, None, *args, **kwargs) 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 665, in efail_fn 
Jul 15 20:03:34     slf.fail('expected a non-deterministic error, but it was not raised')

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]

mrshenli · 2020-07-07T18:22:21Z

torch/lib/c10d/test/ProcessGroupNCCLTest.cpp

+    // First we reset the NCCL_BLOCKING_WAIT environment variable to prevent
+    // any unwanted side-effects. We must do this at all exit-points for this
+    // function.
+    setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1);


nit (please feel free to ignore): does it worth it to make this a guard?

Good point. It's kind of overhead to restore original env value

mrshenli · 2020-07-07T18:23:14Z

torch/lib/c10d/test/ProcessGroupNCCLTest.cpp

+  setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1);
+  // If no exception, that is also an error since we expect the test.wait call
+  // to timeout and throw.
+  throw std::runtime_error("BOOM!");


Can we make the message more informative here? Same for the above two throws.

I was actually planning on cleaning up these messages, but just followed the same error messages as the already existing tests for now (they all do throw std::runtime_error("BOOM!") for some reason...). I'm planning on putting up another PR that fixes all of these.

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]

jiayisuse · 2020-07-15T05:37:23Z

torch/lib/c10d/test/ProcessGroupNCCLTest.cpp

+    // First we reset the NCCL_BLOCKING_WAIT environment variable to prevent
+    // any unwanted side-effects. We must do this at all exit-points for this
+    // function.
+    setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1);


Good point. It's kind of overhead to restore original env value

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error. Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/) [ghstack-poisoned]

facebook-github-bot · 2020-07-16T18:16:41Z

This pull request has been merged in 01dcef2.

osalpekar requested review from mrshenli, pietern and zhaojuanmao as code owners July 2, 2020 22:24

osalpekar added 2 commits July 2, 2020 16:37

mrshenli approved these changes Jul 7, 2020

View reviewed changes

osalpekar mentioned this pull request Jul 10, 2020

[Gloo] Tests for Gloo Async Work Wait-level Timeouts #41265

Closed

jiayisuse approved these changes Jul 15, 2020

View reviewed changes

osalpekar added 2 commits July 15, 2020 11:37

facebook-github-bot closed this in 01dcef2 Jul 16, 2020

facebook-github-bot added the merged label Jul 16, 2020

facebook-github-bot deleted the gh/osalpekar/51/head branch July 20, 2020 14:18

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] Tests for WorkNCCL::wait with Timeouts #40947

[NCCL] Tests for WorkNCCL::wait with Timeouts #40947

Uh oh!

osalpekar commented Jul 2, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jul 2, 2020 •

edited

Loading

Uh oh!

mrshenli Jul 7, 2020

Uh oh!

jiayisuse Jul 15, 2020

Uh oh!

mrshenli Jul 7, 2020

Uh oh!

osalpekar Jul 10, 2020

Uh oh!

jiayisuse Jul 15, 2020

Uh oh!

facebook-github-bot commented Jul 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[NCCL] Tests for WorkNCCL::wait with Timeouts #40947

[NCCL] Tests for WorkNCCL::wait with Timeouts #40947

Uh oh!

Conversation

osalpekar commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

pytorch_windows_vs2019_py36_cuda10.1_test2 (1/3)

pytorch_windows_vs2019_py36_cuda10.1_test1 (2/3)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (3/3)

ci.pytorch.org: 1 failed

Uh oh!

mrshenli Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

jiayisuse Jul 15, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

osalpekar Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

jiayisuse Jul 15, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

osalpekar commented Jul 2, 2020 •

edited

Loading

dr-ci bot commented Jul 2, 2020 •

edited

Loading