Clip Binomial results for different endpoints in curand_uniform #42702

mattip · 2020-08-06T21:47:34Z

As documented (search for curand_uniform on the page), curand_uniform returns "from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded." These endpoints are different than the CPU equivalent, and makes the calculation in the PR fail when the value is 1.0.

The test from the issue is added, it failed for me consistently before the PR even though I cut the number of samples by 10.

dr-ci · 2020-08-06T21:49:07Z

💊 CI failures summary and remediations

As of commit e30b771 (more details on the Dr. CI page):

5/5 failures possibly* introduced in this PR
- 1/5 non-CircleCI failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_test (1/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 18:27:21 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311

Aug 13 18:27:18 The directory '/var/lib/jenkins/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. 
Aug 13 18:27:18 Collecting pip 
Aug 13 18:27:19   Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB) 
Aug 13 18:27:19 Installing collected packages: pip 
Aug 13 18:27:19   Found existing installation: pip 9.0.1 
Aug 13 18:27:19     Uninstalling pip-9.0.1: 
Aug 13 18:27:19       Successfully uninstalled pip-9.0.1 
Aug 13 18:27:20 Successfully installed pip-20.2.2 
Aug 13 18:27:20 + pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.4.0.dev202007311 
Aug 13 18:27:21 ERROR: Could not find a version that satisfies the requirement ort-nightly==1.4.0.dev202007311 (from versions: 1.4.0.dev202007271, 1.4.0.dev202008122) 
Aug 13 18:27:21 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311

pytorch_macos_10_13_py3_test (2/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 12:11:52 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future

Aug 13 12:11:52 At: 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Aug 13 12:11:52  
Aug 13 12:11:52 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Aug 13 12:11:52  
Aug 13 12:11:52 At: 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Aug 13 12:11:52  
Aug 13 12:11:52 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Aug 13 12:11:52  
Aug 13 12:11:52 At: 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Aug 13 12:11:52  
Aug 13 12:11:52 ok (1.250s) 
Aug 13 12:11:53   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.272s) 
Aug 13 12:11:55   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.253s) 
Aug 13 12:11:56   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.368s) 
Aug 13 12:12:04   test_rpc_timeouts (__main__.ProcessGroupRpcTestWithSpawn) ... ok (7.980s)

caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 18:27:25 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311

Aug 13 18:27:22 The directory '/var/lib/jenkins/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. 
Aug 13 18:27:22 Collecting pip 
Aug 13 18:27:22   Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB) 
Aug 13 18:27:23 Installing collected packages: pip 
Aug 13 18:27:23   Found existing installation: pip 9.0.1 
Aug 13 18:27:23     Uninstalling pip-9.0.1: 
Aug 13 18:27:23       Successfully uninstalled pip-9.0.1 
Aug 13 18:27:24 Successfully installed pip-20.2.2 
Aug 13 18:27:24 + pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.4.0.dev202007311 
Aug 13 18:27:25 ERROR: Could not find a version that satisfies the requirement ort-nightly==1.4.0.dev202007311 (from versions: 1.4.0.dev202007271, 1.4.0.dev202008122) 
Aug 13 18:27:25 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311

caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test (4/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 18:27:19 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311

Aug 13 18:27:16 The directory '/var/lib/jenkins/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. 
Aug 13 18:27:16 Collecting pip 
Aug 13 18:27:17   Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB) 
Aug 13 18:27:17 Installing collected packages: pip 
Aug 13 18:27:17   Found existing installation: pip 9.0.1 
Aug 13 18:27:17     Uninstalling pip-9.0.1: 
Aug 13 18:27:17       Successfully uninstalled pip-9.0.1 
Aug 13 18:27:18 Successfully installed pip-20.2.2 
Aug 13 18:27:18 + pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.4.0.dev202007311 
Aug 13 18:27:19 ERROR: Could not find a version that satisfies the requirement ort-nightly==1.4.0.dev202007311 (from versions: 1.4.0.dev202007271, 1.4.0.dev202008122) 
Aug 13 18:27:19 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 34 times.

ailzhang · 2020-08-07T01:43:10Z

cc @ngimel for a review since you are in the original issue thread, please feel free to remove the assignment if doesn't fit.

aten/src/ATen/native/Distributions.h

mattip · 2020-08-07T05:29:25Z

Lint error is due to CMake not finding Protobuf:

CMake Error at /usr/local/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
  Could NOT find Protobuf: Found unsuitable version
  "Protobuf_VERSION_NOTFOUND", but required is at least "3" (found
  Protobuf_LIBRARY-NOTFOUND;-pthread)

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build failure is caused by a CMake version discrepancy"

CMake Error at third_party/tensorpipe/CMakeLists.txt:12 (cmake_minimum_required):
Aug 06 22:05:47   CMake 3.17 or higher is required.  You are running version 3.5.1

I don't think I caused those ...

mattip · 2020-08-07T08:35:42Z

I don't think I caused those ...

Yes, I did. I picked up some third-party submodule changes by mistake. Fixing.

aten/src/ATen/native/cuda/Distributions.cu

mattip · 2020-08-10T07:17:58Z

A few of the builds got this error, should I rebase off master again?

fatal: reference is not a tree: f015d698006c4a11be15b1ebb75b3b9bb317b914
Unable to checkout 'f015d698006c4a11be15b1ebb75b3b9bb317b914' in submodule \
    path 'third_party/tensorpipe'

ngimel · 2020-08-10T20:17:43Z

aten/src/ATen/native/cuda/Distributions.cu

This won't work, because val is float, and you are static_casting and not reinterpret-casting it to uint64_t, which means that it'll either become 0 or 1.

Suggested change

auto val = curand_uniform(&state);

auto val = curand_uniform(&state);

uint val = curand(&state); //need just bits constexpr auto MASK = static_cast<uint>((static_cast<uint64_t>(1) << std::numeric_limits<float>::digits) - 1);//MASK should be uint, int64 arithmetic is much slower. In honesty, even static_cast<uint64_t>(1) is not needed because digist for float is 24, so 32 bits is enough, but it does not matter constexpr auto DIVISOR = static_cast<float>(1) / (static_cast<uint>(1) << std::numeric_limits<float>::digits);//uint64_t is not needed here either

return (val & MASK) * DIVISOR;

thanks. I assumed tests would catch any gross errors :(

adopted in 7257949. Also rebased off master to hopefully clear the CI errors

IT's actually concerning that the tests don't catch those errors.

It's uint32_t probably to fix windows builds, sorry.

mattip · 2020-08-13T06:21:04Z

The pr/pytorch-linux-xenial-rocm3.5.1-py3.6 build is failing test_sum_fp16 (__main__.TestCuda) and test_trilu_indices (__main__.TestCuda). I can't see a connection to my changes:

this is cuda code, my changes are CPU only
there should be no connection to binomial sampling in those tests

ngimel · 2020-08-13T16:45:48Z

You changes are cuda :-) but it does not look like they are related. ROCm has been very flaky lately. Let's try to merge.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2020-08-13T16:49:21Z

Oh, btw, sorry to ask, but in the test you are doing, can you also roughly check the number of 0's and 1's, so that at least glaring errors in the distribution are caught?

mattip · 2020-08-13T16:55:06Z

in the test you are doing, can you also roughly check the number of 0's and 1's

Yup.

mattip · 2020-08-13T21:24:56Z

I added a smoke test, but now there are new CI failures.

ngimel · 2020-08-13T23:01:07Z

Test failures are unrelated. Thank you!

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-08-14T20:11:00Z

@ngimel merged this pull request in 059aa34.

mattip changed the title ~~Clip Binomial results for numerical instability~~ Clip Binomial results for different endpoints in curand_uniform Aug 6, 2020

mattip force-pushed the cuda_random_1 branch from 6f3ad16 to cd6ffdb Compare August 6, 2020 21:51

pytorchbot added the open source label Aug 6, 2020

mattip force-pushed the cuda_random_1 branch from cd6ffdb to f1eb538 Compare August 6, 2020 22:01

ailzhang requested a review from ngimel August 7, 2020 01:42

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 7, 2020

ngimel reviewed Aug 7, 2020

View reviewed changes

aten/src/ATen/native/Distributions.h Outdated Show resolved Hide resolved

mattip force-pushed the cuda_random_1 branch from f1eb538 to dbdbc94 Compare August 7, 2020 11:08

ngimel reviewed Aug 7, 2020

View reviewed changes

aten/src/ATen/native/cuda/Distributions.cu Outdated Show resolved Hide resolved

ngimel reviewed Aug 10, 2020

View reviewed changes

mattip added 4 commits August 10, 2020 23:22

Clip Binomial results for curand_uniform endpoints

3e9434b

do a deeper fix (from review)

e6a3ed1

rescale curand_uniform_wrapper as in uniform_real

2d0dd06

fix from review, rebase

7257949

mattip force-pushed the cuda_random_1 branch from e8219a8 to 7257949 Compare August 10, 2020 20:56

fix from review

a6a4ae4

mattip force-pushed the cuda_random_1 branch from 181046f to a6a4ae4 Compare August 12, 2020 17:29

facebook-github-bot reviewed Aug 13, 2020

View reviewed changes

ngimel approved these changes Aug 13, 2020

View reviewed changes

add a smoke test

e30b771

mattip force-pushed the cuda_random_1 branch from c831021 to e30b771 Compare August 13, 2020 17:41

facebook-github-bot reviewed Aug 13, 2020

View reviewed changes

facebook-github-bot closed this in 059aa34 Aug 14, 2020

facebook-github-bot added the merged label Aug 14, 2020

mruberry added the Merged label Oct 28, 2020

	auto val = curand_uniform(&state);
	auto val = curand_uniform(&state);

Clip Binomial results for different endpoints in curand_uniform #42702

Clip Binomial results for different endpoints in curand_uniform #42702

Uh oh!

Conversation

mattip commented Aug 6, 2020

Uh oh!

dr-ci bot commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 4 new failures recognized by patterns

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_test (1/4)

pytorch_macos_10_13_py3_test (2/4)

caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test (3/4)

caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test (4/4)

ci.pytorch.org: 1 failed

Uh oh!

ailzhang commented Aug 7, 2020

Uh oh!

Uh oh!

mattip commented Aug 7, 2020

Uh oh!

mattip commented Aug 7, 2020

Uh oh!

Uh oh!

mattip commented Aug 10, 2020

Uh oh!

ngimel Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattip Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattip Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattip commented Aug 13, 2020

Uh oh!

ngimel commented Aug 13, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Aug 13, 2020

Uh oh!

mattip commented Aug 13, 2020

Uh oh!

mattip commented Aug 13, 2020

Uh oh!

ngimel commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dr-ci bot commented Aug 6, 2020 •

edited

Loading

ngimel commented Aug 13, 2020 •

edited

Loading