Skip to content

Conversation

@syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Jun 4, 2019

Stack from ghstack:

Resubmit of #16604
This PR originated from #13070.

Summary:

The purpose of this PR is to refactor Random Number Generator (RNG) design in ATen. Currently, RNGs in PyTorch has an assymetrical design, i.e. CPU Generators use an ATen class, whereas CUDA Generators use legacy THC code (THCRNGState, THCState, THCRandom_Init etc.). Moreover, the concept of generators in ATen aren't clear from its current design. This PR is the first part of more refactoring effort surrounding RNGs and only handles the PyTorch front-end and CPU backend for now. It does the following:

  • Clarifies generator concepts by reviewing Generator, CPUGenerator and CUDAGenerator classes.
  • Moves mt19937 from TH to aten as MT19937RNGEngine.h and also moves distributions from THRandom.cpp to DistributionsHelper.h. Adds PhiloxRNGEngine.h engine and adds unit tests for it.
  • Fixes hardcoded generator related code in several python files used for code generation, such as function_wrapper.py etc.
  • Fixes generator front-end python bindings to include device kwarg and default kwarg
  • Removes creation of generator from Types.
  • Updates documentations and comments and adds documentation for torch.Generator api.

Differential Revision: D15696497

@pytorchbot pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators labels Jun 4, 2019
@syed-ahmed
Copy link
Collaborator Author

syed-ahmed commented Jun 4, 2019

TODO:

  • Release notes
@syed-ahmed I think a good idea is to write "how to migrate your code" release notes (and if we can't write these migration notes in a reasonable way, we should add back some methods for BC). Things to cover:

the torch::Generator type is different now, right? If you previously wrote torch::Generator& x = ... in your code, what should I write now?
If any methods from at::Context were removed, document their replacements
If any methods on what previously was torch::Generator changed, document their replacements
One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces.

New changes:

  • getNonDeterministicRandom is now analogous to uint64_t THRandom_seed(THGenerator *_generator). @ezyang I think this would fix the failing tests in the internal CI.

@syed-ahmed
Copy link
Collaborator Author

Interestingly, the current code fails with time(0) seed initialization in windows (because it was never tested :). I'll try replacing time(0) with high_resolution_clock and see if it helps for windows.)

 ======================================================================
20:59:58 FAIL: test_generator_cpu (__main__.TestTorch)
20:59:58 ----------------------------------------------------------------------
20:59:58 Traceback (most recent call last):
20:59:58   File "test_torch.py", line 9306, in test_generator_cpu
20:59:58     self.assertNotEqual(g1.initial_seed(), g2.initial_seed())
20:59:58   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\common_utils.py", line 569, in assertNotEqual
20:59:58     super(TestCase, self).assertNotEqual(x, y, message)
20:59:58 AssertionError: 1559681949 == 1559681949 : 
20:59:58 

[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
@ezyang ezyang self-requested a review June 6, 2019 16:43
@ezyang
Copy link
Contributor

ezyang commented Jun 6, 2019

Don't forget to tag me as reviewer ;)

@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2019
[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
@syed-ahmed
Copy link
Collaborator Author

@ezyang Here's a small update. I was finishing up the CUDA part to not worry about the One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces. @mruberry and @ngimel gave a review on this CPU part of this PR and the changes are in this commit - 9f9de44. Things that have changed since you last reviewed are:

  • we are returning shared_ptrs now in getDefaultCPUGenerator() and createCPUGenerator()
  • Some stylistic changes: use of curly braces, removing complex CRTP pattern (CloneableGenerator), comments, renaming of check_generator_with_default to get_generator_or_default
  • getNonDeterministicRandom is now analogous to uint64_t THRandom_seed(THGenerator *_generator) amd THCRandom_seed. In addition it contains how CUDA gets non deterministic random currently, so that we can keep all crime in one place, so that it's easier to modify in the future.

@mruberry @ngimel Please review the changes in this commit. 9f9de44. This is a follow-up on our internal review. Please feel free to leave more comments in this PR.

@ezyang I'm still finishing up the release notes. In the meanwhile, if you could import this PR and tell me if it's passing the internal CI, that would be great.

@ezyang
Copy link
Contributor

ezyang commented Jun 10, 2019

Some internal breakages:

stderr: hphp/facebook/extensions/tensor/shared/TensorImpl.cpp: In function ‘void facebook::detail::setSeed(int64_t)’:
hphp/facebook/extensions/tensor/shared/TensorImpl.cpp:697:28: error: ‘class at::Context’ has no member named ‘defaultGenerator’
     torch::globalContext().defaultGenerator(torch::kCPU);
                            ^~~~~~~~~~~~~~~~
hphp/facebook/extensions/tensor/shared/TensorImpl.cpp:698:11: error: ‘struct at::Generator’ has no member named ‘manualSeed’; did you mean ‘manualSeedAll’?
   cpu_gen.manualSeed(seed);
           ^~~~~~~~~~
           manualSeedAll

@ezyang
Copy link
Contributor

ezyang commented Jun 10, 2019

I'm actually not sure what I'm supposed to replace torch::globalContext().defaultGenerator with; defaultCPUGenerator is in a detail namespace

[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
@ezyang
Copy link
Contributor

ezyang commented Jun 11, 2019

I got some internal failures for you:

pytorch/gpytorch:test_gpytorch - test_inv_quad_logdet_vector (test.functions.test_inv_quad_log_det.TestInvQuadLogDetNonBatch):

AssertionError: tensor1 (torch.Size([4, 4])) and tensor2 (torch.Size([4, 4])) are not close enough. 
max rtol: 0.39336467	max atol: 0.30066603

This should be https://github.com/cornellius-gp/gpytorch though I haven't tried reproducing the problem with the most recent OSS version of gpytorch. Our internal copy of gpytorch appears to be 781883b67f31f2e8f767f3a171ac190b5a168fef

[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
@syed-ahmed
Copy link
Collaborator Author

I got some internal failures for you:

pytorch/gpytorch:test_gpytorch - test_inv_quad_logdet_vector (test.functions.test_inv_quad_log_det.TestInvQuadLogDetNonBatch):

AssertionError: tensor1 (torch.Size([4, 4])) and tensor2 (torch.Size([4, 4])) are not close enough. 
max rtol: 0.39336467	max atol: 0.30066603

This should be https://github.com/cornellius-gp/gpytorch though I haven't tried reproducing the problem with the most recent OSS version of gpytorch. Our internal copy of gpytorch appears to be 781883b67f31f2e8f767f3a171ac190b5a168fef

@ezyang The source of this error was here. Following is what we have currently in pytorch.

void bernoulli_mkl_kernel(Tensor &self, const double p, Generator* gen) {
THGenerator* generator = get_generator(gen);
int64_t seed;
{
std::lock_guard<std::mutex> lock(generator->mutex);
seed = THRandom_random(generator);
}

The bernoulli mkl kernel is currently being seeded with THRandom_seed. THRandom_seed returns a uint64_t, but under the hood it's a 32-bit random (@.@). Since, the bernoulli mkl kernel (correctly) wants to seed with a 64 bit random, I changed the THRandom_seed function call to my gen->random64() in this patch. My random64 function uses two 32-bit randoms to construct a 64-bit random. Hence, as it turns out this difference with THRandom_seed perturbed the test in gpytorch. I changed the call gen->random64() to the 32-bit version here: 3565af2. and checked the test in question passes.

As a side note, the test in gpytorch passes with UNLOCK_SEED=true which tells me that seeding the engine with 64 bit number would still have worked. This article says - The Mersenne Twister is prone to “bad states” (all zeros causes it to not work at all, whereas lots of zero bits are merely bad)—taking 0, or 1, or 2 and repeating it 624 times could be disastrous. However, throughout this PR, I have come to believe that pytorch and other derivative project test suites contradicts this statement, as more tests fail with a true 64 bit seed, than a 32 bit seed ¯_(ツ)_/¯.

@ezyang
Copy link
Contributor

ezyang commented Jun 12, 2019

I changed the call gen->random64() to the 32-bit version here: 3565af2. and checked the test in question passes.

OK. This sounds like a good change from a bug-for-bug compatibility standpoint, but I kind of agree that your change (it wanted a 64 bit seed, give it a 64 bit seed) seems more correct. Maybe we should file a bug in gpytorch on this?

@ezyang
Copy link
Contributor

ezyang commented Jun 12, 2019

Landing it! Wish me luck!!

@ezyang
Copy link
Contributor

ezyang commented Jun 12, 2019

Don't forget to finish the release notes!

[CPU] Refactor Random Number Generators in ATen

gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
@zou3519 zou3519 deleted the gh/syed-ahmed/14/head branch June 12, 2019 20:04
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 12, 2019
Summary:
Pull Request resolved: pytorch/pytorch#21364
ghimport-source-id: ca7d37e10190ba46dc8512f437404ca9216d3369

Differential Revision: D15696497

Pulled By: ezyang

fbshipit-source-id: 2e713b8566ae915e175b5a79ac1dd9b86cc2a23d
C10_HIP_HOST_DEVICE Array() = default;
C10_HIP_HOST_DEVICE Array(const Array&) = default;
C10_HIP_HOST_DEVICE Array& operator=(const Array&) = default;
#ifdef __HIP_PLATFORM_HCC__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ifdef necessary? C10_HOST_DEVICE is empty if __HIP_PLATFORM_HCC__ isn't defined

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was necessary when I moved it to ATen/core as is. Without this change, I was getting the following error:

23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/CPUGenerator.cpp.o
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/CPUTypeDefault.cpp.o
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/Context.cpp.o
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/DLConvertor.cpp.o
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15:
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:3: error: unknown type name '__host__'
23:21:53   C10_HIP_HOST_DEVICE Array() = default;
23:21:53   ^
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:29: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
23:21:53                             ^
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15:
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:22: error: expected ';' at end of declaration list
23:21:53   C10_HIP_HOST_DEVICE Array() = default;
23:21:53                      ^
23:21:53                      ;
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:22:3: error: unknown type name '__host__'
23:21:53   C10_HIP_HOST_DEVICE Array(const Array&) = default;
23:21:53   ^
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:29: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
23:21:53                             ^
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15:
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:22:3: error: duplicate member '__device__'
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:38: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
23:21:53                                      ^
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:3: note: previous declaration is here
23:21:53   C10_HIP_HOST_DEVICE Array() = default;
23:21:53   ^
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:38: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the ROCm build? This doesn't make any sense. The ifdef around here exactly matches the ifdef in Macros.h. I'm concerned that something is messed up in the includes or macros.

pytorch/c10/macros/Macros.h

Lines 188 to 192 in 556af7c

#ifdef __HIP_PLATFORM_HCC__
#define C10_HIP_HOST_DEVICE __host__ __device__
#else
#define C10_HIP_HOST_DEVICE
#endif

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be the ifdef in the Macros should be HIPCC

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in ae342fd.

namespace {
template <typename scalar_t>
void randperm_cpu(Tensor& result, int64_t n, THGenerator* generator) {
std::lock_guard<std::mutex> lock(generator->mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we removed the lock_guard here. Is that OK? We are still using the generator right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I moved the lock to randperm_out_cpu. I decided to keep the get_generator_or_default and the lock_guard next to each other, in most cases, to emphasize that we don't forgot taking out the lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants