[CPU] Refactor Random Number Generators in ATen #21364

syed-ahmed · 2019-06-04T18:26:45Z

Stack from ghstack:

[CPU] Refactor Random Number Generators in ATen #21364 [CPU] Refactor Random Number Generators in ATen
Remove curandStateMTGP32 usage #21301 Remove curandStateMTGP32 usage
Speedup bernoulli_scalar_cuda_kernel with grid-stride loop #21300 Speedup bernoulli_scalar_cuda_kernel with grid-stride loop
Move THCTensor_(lognormal) to ATen #21299 Move THCTensor_(lognormal) to ATen
Move THCTensor_(geometric) to ATen #21298 Move THCTensor_(geometric) to ATen
Move THCTensor_(exponential) to ATen #21297 Move THCTensor_(exponential) to ATen

Resubmit of #16604
This PR originated from #13070.

Summary:

The purpose of this PR is to refactor Random Number Generator (RNG) design in ATen. Currently, RNGs in PyTorch has an assymetrical design, i.e. CPU Generators use an ATen class, whereas CUDA Generators use legacy THC code (THCRNGState, THCState, THCRandom_Init etc.). Moreover, the concept of generators in ATen aren't clear from its current design. This PR is the first part of more refactoring effort surrounding RNGs and only handles the PyTorch front-end and CPU backend for now. It does the following:

Clarifies generator concepts by reviewing Generator, CPUGenerator and CUDAGenerator classes.
Moves mt19937 from TH to aten as MT19937RNGEngine.h and also moves distributions from THRandom.cpp to DistributionsHelper.h. Adds PhiloxRNGEngine.h engine and adds unit tests for it.
Fixes hardcoded generator related code in several python files used for code generation, such as function_wrapper.py etc.
Fixes generator front-end python bindings to include device kwarg and default kwarg
Removes creation of generator from Types.
Updates documentations and comments and adds documentation for torch.Generator api.

Differential Revision: D15696497

syed-ahmed · 2019-06-04T18:33:30Z

TODO:

Release notes

@syed-ahmed I think a good idea is to write "how to migrate your code" release notes (and if we can't write these migration notes in a reasonable way, we should add back some methods for BC). Things to cover:

the torch::Generator type is different now, right? If you previously wrote torch::Generator& x = ... in your code, what should I write now?
If any methods from at::Context were removed, document their replacements
If any methods on what previously was torch::Generator changed, document their replacements
One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces.

Internal review changes from @mruberry and @ngimel

New changes:

getNonDeterministicRandom is now analogous to uint64_t THRandom_seed(THGenerator *_generator). @ezyang I think this would fix the failing tests in the internal CI.

syed-ahmed · 2019-06-04T21:26:37Z

Interestingly, the current code fails with time(0) seed initialization in windows (because it was never tested :). I'll try replacing time(0) with high_resolution_clock and see if it helps for windows.)

 ======================================================================
20:59:58 FAIL: test_generator_cpu (__main__.TestTorch)
20:59:58 ----------------------------------------------------------------------
20:59:58 Traceback (most recent call last):
20:59:58   File "test_torch.py", line 9306, in test_generator_cpu
20:59:58     self.assertNotEqual(g1.initial_seed(), g2.initial_seed())
20:59:58   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\common_utils.py", line 569, in assertNotEqual
20:59:58     super(TestCase, self).assertNotEqual(x, y, message)
20:59:58 AssertionError: 1559681949 == 1559681949 : 
20:59:58

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

ezyang · 2019-06-06T16:43:38Z

Don't forget to tag me as reviewer ;)

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

syed-ahmed · 2019-06-08T02:21:22Z

@ezyang Here's a small update. I was finishing up the CUDA part to not worry about the One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces. @mruberry and @ngimel gave a review on this CPU part of this PR and the changes are in this commit - 9f9de44. Things that have changed since you last reviewed are:

we are returning shared_ptrs now in getDefaultCPUGenerator() and createCPUGenerator()
Some stylistic changes: use of curly braces, removing complex CRTP pattern (CloneableGenerator), comments, renaming of check_generator_with_default to get_generator_or_default
getNonDeterministicRandom is now analogous to uint64_t THRandom_seed(THGenerator *_generator) amd THCRandom_seed. In addition it contains how CUDA gets non deterministic random currently, so that we can keep all crime in one place, so that it's easier to modify in the future.

@mruberry @ngimel Please review the changes in this commit. 9f9de44. This is a follow-up on our internal review. Please feel free to leave more comments in this PR.

@ezyang I'm still finishing up the release notes. In the meanwhile, if you could import this PR and tell me if it's passing the internal CI, that would be great.

ezyang · 2019-06-10T14:14:55Z

Some internal breakages:

stderr: hphp/facebook/extensions/tensor/shared/TensorImpl.cpp: In function ‘void facebook::detail::setSeed(int64_t)’:
hphp/facebook/extensions/tensor/shared/TensorImpl.cpp:697:28: error: ‘class at::Context’ has no member named ‘defaultGenerator’
     torch::globalContext().defaultGenerator(torch::kCPU);
                            ^~~~~~~~~~~~~~~~
hphp/facebook/extensions/tensor/shared/TensorImpl.cpp:698:11: error: ‘struct at::Generator’ has no member named ‘manualSeed’; did you mean ‘manualSeedAll’?
   cpu_gen.manualSeed(seed);
           ^~~~~~~~~~
           manualSeedAll

ezyang · 2019-06-10T14:17:05Z

I'm actually not sure what I'm supposed to replace torch::globalContext().defaultGenerator with; defaultCPUGenerator is in a detail namespace

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

ezyang · 2019-06-11T14:07:29Z

I got some internal failures for you:

pytorch/gpytorch:test_gpytorch - test_inv_quad_logdet_vector (test.functions.test_inv_quad_log_det.TestInvQuadLogDetNonBatch):

AssertionError: tensor1 (torch.Size([4, 4])) and tensor2 (torch.Size([4, 4])) are not close enough. 
max rtol: 0.39336467	max atol: 0.30066603

This should be https://github.com/cornellius-gp/gpytorch though I haven't tried reproducing the problem with the most recent OSS version of gpytorch. Our internal copy of gpytorch appears to be 781883b67f31f2e8f767f3a171ac190b5a168fef

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

syed-ahmed · 2019-06-12T00:34:20Z

I got some internal failures for you:
pytorch/gpytorch:test_gpytorch - test_inv_quad_logdet_vector (test.functions.test_inv_quad_log_det.TestInvQuadLogDetNonBatch):

AssertionError: tensor1 (torch.Size([4, 4])) and tensor2 (torch.Size([4, 4])) are not close enough. 
max rtol: 0.39336467	max atol: 0.30066603
This should be https://github.com/cornellius-gp/gpytorch though I haven't tried reproducing the problem with the most recent OSS version of gpytorch. Our internal copy of gpytorch appears to be 781883b67f31f2e8f767f3a171ac190b5a168fef

@ezyang The source of this error was here. Following is what we have currently in pytorch.

pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp

Lines 126 to 132 in c2a08d3

    
           void bernoulli_mkl_kernel(Tensor &self, const double p, Generator* gen) { 
        
             THGenerator* generator = get_generator(gen); 
        
             int64_t seed; 
        
             { 
        
               std::lock_guard<std::mutex> lock(generator->mutex); 
        
               seed = THRandom_random(generator); 
        
             }

The bernoulli mkl kernel is currently being seeded with THRandom_seed. THRandom_seed returns a uint64_t, but under the hood it's a 32-bit random (@.@). Since, the bernoulli mkl kernel (correctly) wants to seed with a 64 bit random, I changed the THRandom_seed function call to my gen->random64() in this patch. My random64 function uses two 32-bit randoms to construct a 64-bit random. Hence, as it turns out this difference with THRandom_seed perturbed the test in gpytorch. I changed the call gen->random64() to the 32-bit version here: 3565af2. and checked the test in question passes.

As a side note, the test in gpytorch passes with UNLOCK_SEED=true which tells me that seeding the engine with 64 bit number would still have worked. This article says - The Mersenne Twister is prone to “bad states” (all zeros causes it to not work at all, whereas lots of zero bits are merely bad)—taking 0, or 1, or 2 and repeating it 624 times could be disastrous. However, throughout this PR, I have come to believe that pytorch and other derivative project test suites contradicts this statement, as more tests fail with a true 64 bit seed, than a 32 bit seed ¯_(ツ)_/¯.

ezyang · 2019-06-12T12:57:26Z

I changed the call gen->random64() to the 32-bit version here: 3565af2. and checked the test in question passes.

OK. This sounds like a good change from a bug-for-bug compatibility standpoint, but I kind of agree that your change (it wanted a 64 bit seed, give it a 64 bit seed) seems more correct. Maybe we should file a bug in gpytorch on this?

ezyang · 2019-06-12T18:54:52Z

Landing it! Wish me luck!!

ezyang · 2019-06-12T18:55:17Z

Don't forget to finish the release notes!

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

Summary: Pull Request resolved: pytorch/pytorch#21364 ghimport-source-id: ca7d37e10190ba46dc8512f437404ca9216d3369 Differential Revision: D15696497 Pulled By: ezyang fbshipit-source-id: 2e713b8566ae915e175b5a79ac1dd9b86cc2a23d

colesbury · 2019-06-12T22:02:03Z

aten/src/ATen/core/Array.h

-  C10_HIP_HOST_DEVICE Array() = default;
-  C10_HIP_HOST_DEVICE Array(const Array&) = default;
-  C10_HIP_HOST_DEVICE Array& operator=(const Array&) = default;
+#ifdef __HIP_PLATFORM_HCC__


Was this ifdef necessary? C10_HOST_DEVICE is empty if __HIP_PLATFORM_HCC__ isn't defined

Yeah this was necessary when I moved it to ATen/core as is. Without this change, I was getting the following error:

23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/CPUGenerator.cpp.o 23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/CPUTypeDefault.cpp.o 23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/Context.cpp.o 23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/DLConvertor.cpp.o 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1: 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5: 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15: 23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:3: error: unknown type name '__host__' 23:21:53 C10_HIP_HOST_DEVICE Array() = default; 23:21:53 ^ 23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:29: note: expanded from macro 'C10_HIP_HOST_DEVICE' 23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__ 23:21:53 ^ 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1: 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5: 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15: 23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:22: error: expected ';' at end of declaration list 23:21:53 C10_HIP_HOST_DEVICE Array() = default; 23:21:53 ^ 23:21:53 ; 23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:22:3: error: unknown type name '__host__' 23:21:53 C10_HIP_HOST_DEVICE Array(const Array&) = default; 23:21:53 ^ 23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:29: note: expanded from macro 'C10_HIP_HOST_DEVICE' 23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__ 23:21:53 ^ 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1: 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5: 23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15: 23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:22:3: error: duplicate member '__device__' 23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:38: note: expanded from macro 'C10_HIP_HOST_DEVICE' 23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__ 23:21:53 ^ 23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:3: note: previous declaration is here 23:21:53 C10_HIP_HOST_DEVICE Array() = default; 23:21:53 ^ 23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:38: note: expanded from macro 'C10_HIP_HOST_DEVICE' 23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__

This is in the ROCm build? This doesn't make any sense. The ifdef around here exactly matches the ifdef in Macros.h. I'm concerned that something is messed up in the includes or macros.

pytorch/c10/macros/Macros.h

Lines 188 to 192 in 556af7c

#ifdef __HIP_PLATFORM_HCC__

#define C10_HIP_HOST_DEVICE __host__ __device__

#else

#define C10_HIP_HOST_DEVICE

#endif

May be the ifdef in the Macros should be HIPCC

facebook-github-bot · 2019-06-12T22:14:16Z

@ezyang merged this pull request in ae342fd.

ezyang · 2019-06-13T13:12:39Z

aten/src/ATen/native/TensorFactories.cpp

 namespace {
 template <typename scalar_t>
-void randperm_cpu(Tensor& result, int64_t n, THGenerator* generator) {
-  std::lock_guard<std::mutex> lock(generator->mutex);


I just realized we removed the lock_guard here. Is that OK? We are still using the generator right?

Yeah. I moved the lock to randperm_out_cpu. I decided to keep the get_generator_or_default and the lock_guard next to each other, in most cases, to emphasize that we don't forgot taking out the lock.

pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators labels Jun 4, 2019

[CPU] Refactor Random Number Generators in ATen

8b1363c

syed-ahmed added 2 commits June 4, 2019 15:27

Update on "[CPU] Refactor Random Number Generators in ATen"

3dca93c

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

Update on "[CPU] Refactor Random Number Generators in ATen"

9f9de44

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

ezyang added the open source label Jun 5, 2019

ezyang self-requested a review June 6, 2019 16:43

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2019

syed-ahmed mentioned this pull request Jun 8, 2019

[CUDA] Refactor Random Number Generators in ATen #21555

Closed

Update on "[CPU] Refactor Random Number Generators in ATen"

8694239

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

ezyang approved these changes Jun 10, 2019

View reviewed changes

syed-ahmed added 2 commits June 10, 2019 15:19

Update on "[CPU] Refactor Random Number Generators in ATen"

97d7cff

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

Update on "[CPU] Refactor Random Number Generators in ATen"

c6522dd

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

syed-ahmed added 2 commits June 11, 2019 17:13

Update on "[CPU] Refactor Random Number Generators in ATen"

3565af2

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

Update on "[CPU] Refactor Random Number Generators in ATen"

fcb8ce7

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

Update on "[CPU] Refactor Random Number Generators in ATen"

5ba62e1

[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head

facebook-github-bot closed this in ae342fd Jun 12, 2019

zou3519 deleted the gh/syed-ahmed/14/head branch June 12, 2019 20:04

colesbury reviewed Jun 12, 2019

View reviewed changes

facebook-github-bot added the merged label Jun 12, 2019

ezyang reviewed Jun 13, 2019

View reviewed changes

mruberry added the Merged label Oct 28, 2020

	#ifdef __HIP_PLATFORM_HCC__
	#define C10_HIP_HOST_DEVICE __host__ __device__
	#else
	#define C10_HIP_HOST_DEVICE
	#endif

[CPU] Refactor Random Number Generators in ATen #21364

[CPU] Refactor Random Number Generators in ATen #21364

Uh oh!

Conversation

syed-ahmed commented Jun 4, 2019 • edited by ezyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Uh oh!

syed-ahmed commented Jun 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syed-ahmed commented Jun 4, 2019

Uh oh!

ezyang commented Jun 6, 2019

Uh oh!

syed-ahmed commented Jun 8, 2019

Uh oh!

ezyang commented Jun 10, 2019

Uh oh!

ezyang commented Jun 10, 2019

Uh oh!

ezyang commented Jun 11, 2019

Uh oh!

syed-ahmed commented Jun 12, 2019

Uh oh!

ezyang commented Jun 12, 2019

Uh oh!

ezyang commented Jun 12, 2019

Uh oh!

ezyang commented Jun 12, 2019

Uh oh!

colesbury Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

syed-ahmed Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

colesbury Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

syed-ahmed Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 12, 2019

Uh oh!

ezyang Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

syed-ahmed Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

syed-ahmed commented Jun 4, 2019 •

edited by ezyang

Loading

syed-ahmed commented Jun 4, 2019 •

edited

Loading