Skip to content

Conversation

@osalpekar
Copy link
Member

@osalpekar osalpekar commented Jul 2, 2020

Stack from ghstack:

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: D22173101

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Jul 2, 2020

💊 CI failures summary and remediations

As of commit 3e25ba8 (more details on the Dr. CI page):


  • 4/4 failures possibly* introduced in this PR
    • 1/4 non-CircleCI failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test2 (1/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: expected a non-deterministic error, but it was not raised
  test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.013s) 
 
====================================================================== 
FAIL [0.024s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device 
    return efail_fn(slf, None, *args, **kwargs) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn 
    slf.fail('expected a non-deterministic error, but it was not raised') 
AssertionError: expected a non-deterministic error, but it was not raised 
 
---------------------------------------------------------------------- 
Ran 1800 tests in 690.609s 
 
FAILED (failures=1, skipped=90, expected failures=4) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-PackedSequenceTest-20200715200807.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAddRelu-20200715200807.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAvgPool-20200715200807.xml 

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (2/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: expected a non-deterministic error, but it was not raised
  test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.014s) 
 
====================================================================== 
FAIL [0.023s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device 
    return efail_fn(slf, None, *args, **kwargs) 
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn 
    slf.fail('expected a non-deterministic error, but it was not raised') 
AssertionError: expected a non-deterministic error, but it was not raised 
 
---------------------------------------------------------------------- 
Ran 1800 tests in 698.240s 
 
FAILED (failures=1, skipped=90, expected failures=4) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-PackedSequenceTest-20200715194100.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAddRelu-20200715194100.xml 
Generated XML report: test-reports\python-unittest\TEST-TestAvgPool-20200715194100.xml 

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (3/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 15 20:03:34 FAIL [0.036s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN)
Jul 15 20:03:32   test_sparse_default_std (__main__.TestNNInit) ... ok (0.007s) 
Jul 15 20:03:32   test_sparse_only_works_on_2d_inputs (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:03:33   test_trunc_normal (__main__.TestNNInit) ... ok (0.705s) 
Jul 15 20:03:34   test_uniform (__main__.TestNNInit) ... ok (0.895s) 
Jul 15 20:03:34   test_xavier_normal (__main__.TestNNInit) ... ok (0.104s) 
Jul 15 20:03:34   test_xavier_normal_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:03:34   test_xavier_uniform (__main__.TestNNInit) ... ok (0.078s) 
Jul 15 20:03:34   test_xavier_uniform_errors_on_inputs_smaller_than_2d (__main__.TestNNInit) ... ok (0.001s) 
Jul 15 20:03:34  
Jul 15 20:03:34 ====================================================================== 
Jul 15 20:03:34 FAIL [0.036s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN) 
Jul 15 20:03:34 ---------------------------------------------------------------------- 
Jul 15 20:03:34 Traceback (most recent call last): 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper 
Jul 15 20:03:34     method(*args, **kwargs) 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper 
Jul 15 20:03:34     method(*args, **kwargs) 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 674, in efail_fn_no_device 
Jul 15 20:03:34     return efail_fn(slf, None, *args, **kwargs) 
Jul 15 20:03:34   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 665, in efail_fn 
Jul 15 20:03:34     slf.fail('expected a non-deterministic error, but it was not raised') 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

osalpekar added 2 commits July 2, 2020 16:37
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
// First we reset the NCCL_BLOCKING_WAIT environment variable to prevent
// any unwanted side-effects. We must do this at all exit-points for this
// function.
setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (please feel free to ignore): does it worth it to make this a guard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's kind of overhead to restore original env value

setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1);
// If no exception, that is also an error since we expect the test.wait call
// to timeout and throw.
throw std::runtime_error("BOOM!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the message more informative here? Same for the above two throws.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually planning on cleaning up these messages, but just followed the same error messages as the already existing tests for now (they all do throw std::runtime_error("BOOM!") for some reason...). I'm planning on putting up another PR that fixes all of these.

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
// First we reset the NCCL_BLOCKING_WAIT environment variable to prevent
// any unwanted side-effects. We must do this at all exit-points for this
// function.
setenv(c10d::NCCL_BLOCKING_WAIT, originalBlockingWait, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's kind of overhead to restore original env value

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.

Differential Revision: [D22173101](https://our.internmc.facebook.com/intern/diff/D22173101/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 01dcef2.

@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/51/head branch July 20, 2020 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants