Skip to content

Conversation

@mattip
Copy link
Contributor

@mattip mattip commented Aug 6, 2020

Fixes #42153

As documented (search for curand_uniform on the page), curand_uniform returns "from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded." These endpoints are different than the CPU equivalent, and makes the calculation in the PR fail when the value is 1.0.

The test from the issue is added, it failed for me consistently before the PR even though I cut the number of samples by 10.

@dr-ci
Copy link

dr-ci bot commented Aug 6, 2020

💊 CI failures summary and remediations

As of commit e30b771 (more details on the Dr. CI page):


  • 5/5 failures possibly* introduced in this PR
    • 1/5 non-CircleCI failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build caffe2_onnx_main_py3_6_clang7_ubuntu16_04_test (1/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 18:27:21 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311
Aug 13 18:27:18 The directory '/var/lib/jenkins/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. 
Aug 13 18:27:18 Collecting pip 
Aug 13 18:27:19   Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB) 
Aug 13 18:27:19 Installing collected packages: pip 
Aug 13 18:27:19   Found existing installation: pip 9.0.1 
Aug 13 18:27:19     Uninstalling pip-9.0.1: 
Aug 13 18:27:19       Successfully uninstalled pip-9.0.1 
Aug 13 18:27:20 Successfully installed pip-20.2.2 
Aug 13 18:27:20 + pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.4.0.dev202007311 
Aug 13 18:27:21 ERROR: Could not find a version that satisfies the requirement ort-nightly==1.4.0.dev202007311 (from versions: 1.4.0.dev202007271, 1.4.0.dev202008122) 
Aug 13 18:27:21 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311 

See CircleCI build pytorch_macos_10_13_py3_test (2/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 12:11:52 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Aug 13 12:11:52 At: 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Aug 13 12:11:52  
Aug 13 12:11:52 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Aug 13 12:11:52  
Aug 13 12:11:52 At: 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Aug 13 12:11:52  
Aug 13 12:11:52 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Aug 13 12:11:52  
Aug 13 12:11:52 At: 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Aug 13 12:11:52   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Aug 13 12:11:52  
Aug 13 12:11:52 ok (1.250s) 
Aug 13 12:11:53   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.272s) 
Aug 13 12:11:55   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.253s) 
Aug 13 12:11:56   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.368s) 
Aug 13 12:12:04   test_rpc_timeouts (__main__.ProcessGroupRpcTestWithSpawn) ... ok (7.980s) 

See CircleCI build caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 18:27:25 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311
Aug 13 18:27:22 The directory '/var/lib/jenkins/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. 
Aug 13 18:27:22 Collecting pip 
Aug 13 18:27:22   Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB) 
Aug 13 18:27:23 Installing collected packages: pip 
Aug 13 18:27:23   Found existing installation: pip 9.0.1 
Aug 13 18:27:23     Uninstalling pip-9.0.1: 
Aug 13 18:27:23       Successfully uninstalled pip-9.0.1 
Aug 13 18:27:24 Successfully installed pip-20.2.2 
Aug 13 18:27:24 + pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.4.0.dev202007311 
Aug 13 18:27:25 ERROR: Could not find a version that satisfies the requirement ort-nightly==1.4.0.dev202007311 (from versions: 1.4.0.dev202007271, 1.4.0.dev202008122) 
Aug 13 18:27:25 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311 

See CircleCI build caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test (4/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Aug 13 18:27:19 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311
Aug 13 18:27:16 The directory '/var/lib/jenkins/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. 
Aug 13 18:27:16 Collecting pip 
Aug 13 18:27:17   Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB) 
Aug 13 18:27:17 Installing collected packages: pip 
Aug 13 18:27:17   Found existing installation: pip 9.0.1 
Aug 13 18:27:17     Uninstalling pip-9.0.1: 
Aug 13 18:27:17       Successfully uninstalled pip-9.0.1 
Aug 13 18:27:18 Successfully installed pip-20.2.2 
Aug 13 18:27:18 + pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.4.0.dev202007311 
Aug 13 18:27:19 ERROR: Could not find a version that satisfies the requirement ort-nightly==1.4.0.dev202007311 (from versions: 1.4.0.dev202007271, 1.4.0.dev202008122) 
Aug 13 18:27:19 ERROR: No matching distribution found for ort-nightly==1.4.0.dev202007311 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 34 times.

@mattip mattip changed the title Clip Binomial results for numerical instability Clip Binomial results for different endpoints in curand_uniform Aug 6, 2020
@ailzhang ailzhang requested a review from ngimel August 7, 2020 01:42
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 7, 2020
@ailzhang
Copy link
Contributor

ailzhang commented Aug 7, 2020

cc @ngimel for a review since you are in the original issue thread, please feel free to remove the assignment if doesn't fit.

@mattip
Copy link
Contributor Author

mattip commented Aug 7, 2020

Lint error is due to CMake not finding Protobuf:

CMake Error at /usr/local/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
  Could NOT find Protobuf: Found unsuitable version
  "Protobuf_VERSION_NOTFOUND", but required is at least "3" (found
  Protobuf_LIBRARY-NOTFOUND;-pthread)

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build failure is caused by a CMake version discrepancy"

CMake Error at third_party/tensorpipe/CMakeLists.txt:12 (cmake_minimum_required):
Aug 06 22:05:47   CMake 3.17 or higher is required.  You are running version 3.5.1

I don't think I caused those ...

@mattip
Copy link
Contributor Author

mattip commented Aug 7, 2020

I don't think I caused those ...

Yes, I did. I picked up some third-party submodule changes by mistake. Fixing.

@mattip
Copy link
Contributor Author

mattip commented Aug 10, 2020

A few of the builds got this error, should I rebase off master again?

fatal: reference is not a tree: f015d698006c4a11be15b1ebb75b3b9bb317b914
Unable to checkout 'f015d698006c4a11be15b1ebb75b3b9bb317b914' in submodule \
    path 'third_party/tensorpipe'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, because val is float, and you are static_casting and not reinterpret-casting it to uint64_t, which means that it'll either become 0 or 1.

Suggested change
auto val = curand_uniform(&state);
auto val = curand_uniform(&state);
uint val = curand(&state); //need just bits
constexpr auto MASK = static_cast<uint>((static_cast<uint64_t>(1) << std::numeric_limits<float>::digits) - 1);//MASK should be uint, int64 arithmetic is much slower. In honesty, even static_cast<uint64_t>(1) is not needed because digist for float is 24, so 32 bits is enough, but it does not matter 
constexpr auto DIVISOR = static_cast<float>(1) / (static_cast<uint>(1) << std::numeric_limits<float>::digits);//uint64_t is not needed here either

return (val & MASK) * DIVISOR;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I assumed tests would catch any gross errors :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted in 7257949. Also rebased off master to hopefully clear the CI errors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IT's actually concerning that the tests don't catch those errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's uint32_t probably to fix windows builds, sorry.

@mattip
Copy link
Contributor Author

mattip commented Aug 13, 2020

The pr/pytorch-linux-xenial-rocm3.5.1-py3.6 build is failing test_sum_fp16 (__main__.TestCuda) and test_trilu_indices (__main__.TestCuda). I can't see a connection to my changes:

  • this is cuda code, my changes are CPU only
  • there should be no connection to binomial sampling in those tests

@ngimel
Copy link
Collaborator

ngimel commented Aug 13, 2020

You changes are cuda :-) but it does not look like they are related. ROCm has been very flaky lately. Let's try to merge.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Aug 13, 2020

Oh, btw, sorry to ask, but in the test you are doing, can you also roughly check the number of 0's and 1's, so that at least glaring errors in the distribution are caught?

@mattip
Copy link
Contributor Author

mattip commented Aug 13, 2020

in the test you are doing, can you also roughly check the number of 0's and 1's

Yup.

@mattip
Copy link
Contributor Author

mattip commented Aug 13, 2020

I added a smoke test, but now there are new CI failures.

@ngimel
Copy link
Collaborator

ngimel commented Aug 13, 2020

Test failures are unrelated. Thank you!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 059aa34.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Binomial distribution has small chance of returning -1

6 participants