-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CPU] Refactor Random Number Generators in ATen #21364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
TODO:
New changes:
|
|
Interestingly, the current code fails with |
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
|
Don't forget to tag me as reviewer ;) |
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
|
@ezyang Here's a small update. I was finishing up the CUDA part to not worry about the
@mruberry @ngimel Please review the changes in this commit. 9f9de44. This is a follow-up on our internal review. Please feel free to leave more comments in this PR. @ezyang I'm still finishing up the release notes. In the meanwhile, if you could import this PR and tell me if it's passing the internal CI, that would be great. |
|
Some internal breakages: |
|
I'm actually not sure what I'm supposed to replace |
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
|
I got some internal failures for you: This should be https://github.com/cornellius-gp/gpytorch though I haven't tried reproducing the problem with the most recent OSS version of gpytorch. Our internal copy of gpytorch appears to be 781883b67f31f2e8f767f3a171ac190b5a168fef |
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
@ezyang The source of this error was here. Following is what we have currently in pytorch. pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp Lines 126 to 132 in c2a08d3
The bernoulli mkl kernel is currently being seeded with THRandom_seed. THRandom_seed returns a uint64_t, but under the hood it's a 32-bit random (@.@). Since, the bernoulli mkl kernel (correctly) wants to seed with a 64 bit random, I changed the THRandom_seed function call to my gen->random64() in this patch. My random64 function uses two 32-bit randoms to construct a 64-bit random. Hence, as it turns out this difference with THRandom_seed perturbed the test in gpytorch. I changed the call gen->random64() to the 32-bit version here: 3565af2. and checked the test in question passes.
As a side note, the test in gpytorch passes with |
OK. This sounds like a good change from a bug-for-bug compatibility standpoint, but I kind of agree that your change (it wanted a 64 bit seed, give it a 64 bit seed) seems more correct. Maybe we should file a bug in gpytorch on this? |
|
Landing it! Wish me luck!! |
|
Don't forget to finish the release notes! |
[CPU] Refactor Random Number Generators in ATen gh-metadata: pytorch pytorch 21364 gh/syed-ahmed/14/head
Summary: Pull Request resolved: pytorch/pytorch#21364 ghimport-source-id: ca7d37e10190ba46dc8512f437404ca9216d3369 Differential Revision: D15696497 Pulled By: ezyang fbshipit-source-id: 2e713b8566ae915e175b5a79ac1dd9b86cc2a23d
| C10_HIP_HOST_DEVICE Array() = default; | ||
| C10_HIP_HOST_DEVICE Array(const Array&) = default; | ||
| C10_HIP_HOST_DEVICE Array& operator=(const Array&) = default; | ||
| #ifdef __HIP_PLATFORM_HCC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this ifdef necessary? C10_HOST_DEVICE is empty if __HIP_PLATFORM_HCC__ isn't defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was necessary when I moved it to ATen/core as is. Without this change, I was getting the following error:
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/CPUGenerator.cpp.o
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/CPUTypeDefault.cpp.o
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/Context.cpp.o
23:21:52 [ 43%] Building CXX object caffe2/CMakeFiles/caffe2.dir/__/aten/src/ATen/DLConvertor.cpp.o
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15:
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:3: error: unknown type name '__host__'
23:21:53 C10_HIP_HOST_DEVICE Array() = default;
23:21:53 ^
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:29: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
23:21:53 ^
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15:
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:22: error: expected ';' at end of declaration list
23:21:53 C10_HIP_HOST_DEVICE Array() = default;
23:21:53 ^
23:21:53 ;
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:22:3: error: unknown type name '__host__'
23:21:53 C10_HIP_HOST_DEVICE Array(const Array&) = default;
23:21:53 ^
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:29: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
23:21:53 ^
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.cpp:1:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/CPUGenerator.h:5:
23:21:53 In file included from /var/lib/jenkins/workspace/aten/src/ATen/core/PhiloxRNGEngine.h:15:
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:22:3: error: duplicate member '__device__'
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:38: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
23:21:53 ^
23:21:53 /var/lib/jenkins/workspace/aten/src/ATen/core/Array.h:21:3: note: previous declaration is here
23:21:53 C10_HIP_HOST_DEVICE Array() = default;
23:21:53 ^
23:21:53 /var/lib/jenkins/workspace/c10/macros/Macros.h:189:38: note: expanded from macro 'C10_HIP_HOST_DEVICE'
23:21:53 #define C10_HIP_HOST_DEVICE __host__ __device__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the ROCm build? This doesn't make any sense. The ifdef around here exactly matches the ifdef in Macros.h. I'm concerned that something is messed up in the includes or macros.
Lines 188 to 192 in 556af7c
| #ifdef __HIP_PLATFORM_HCC__ | |
| #define C10_HIP_HOST_DEVICE __host__ __device__ | |
| #else | |
| #define C10_HIP_HOST_DEVICE | |
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be the ifdef in the Macros should be HIPCC
| namespace { | ||
| template <typename scalar_t> | ||
| void randperm_cpu(Tensor& result, int64_t n, THGenerator* generator) { | ||
| std::lock_guard<std::mutex> lock(generator->mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we removed the lock_guard here. Is that OK? We are still using the generator right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I moved the lock to randperm_out_cpu. I decided to keep the get_generator_or_default and the lock_guard next to each other, in most cases, to emphasize that we don't forgot taking out the lock.
Stack from ghstack:
Resubmit of #16604
This PR originated from #13070.
Summary:
The purpose of this PR is to refactor Random Number Generator (RNG) design in ATen. Currently, RNGs in PyTorch has an assymetrical design, i.e. CPU Generators use an ATen class, whereas CUDA Generators use legacy THC code (
THCRNGState, THCState, THCRandom_Initetc.). Moreover, the concept of generators in ATen aren't clear from its current design. This PR is the first part of more refactoring effort surrounding RNGs and only handles the PyTorch front-end and CPU backend for now. It does the following:function_wrapper.pyetc.torch.Generatorapi.Differential Revision: D15696497