-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CPU] Refactor Random Number Generators in ATen - Part 1/n #16604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This isn't WIP anymore right? Better remove the |
aten/src/ATen/core/Generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice docs! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is CPU only in this PR, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the torch.Generator() api is CPU only in this PR (also it's cpu only in master as well):
https://github.com/pytorch/pytorch/blob/2b826c6171608bebd96be6dd1f8201ab6a0a8dff/torch/csrc/Generator.cpp#L47-L69
In the second PR, I want to be able to create CUDAGenerator when I pass the device="cuda" in torch.Generator()
aten/src/ATen/core/Generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC, where did we get this number from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So since the Philox engine has two 32-bit keys (that is it's state is 64 bit I think?), it is recommended that a RNG engine should be seeded with the same number of bits as its state (I think I read this somewhere in one of the links I posted in the last PR). Also the seed value should have a good distribution of 0s and 1s in the bit representation. So I chose a large prime number that can be fit into uint64_t: https://primes.utm.edu/curios/page.php/67280421310721.html. You'll notice another number in:
https://github.com/pytorch/pytorch/blob/2b826c6171608bebd96be6dd1f8201ab6a0a8dff/test/common_utils.py#L46-L48
python side won't let me seed with 64 bit? So I just chose a large prime there under 32 bit.
¯_(ツ)_/¯
aten/src/ATen/core/Generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copy constructor seems pretty dangerous, because you're planning to subclass this class, and add fields to the subclasses. When you allow copy construction of classes with subclasses, you are exposing yourself to the possibility of object slicing (https://stackoverflow.com/questions/274626/what-is-object-slicing). It is probably best if we don't allow copy constructor at all, and instead have a virtual method for doing clones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we should disable all copy/move constructors/assignment on the superclass Generator.
CC @smessmer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not an object slicing issue as long as you take other per reference, but disabling copy constructors where not needed is always a good idea.
aten/src/ATen/core/Generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mutex_
aten/src/ATen/core/Generator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unnecessary. I'd be very happy to say that device_ is immutable and therefore does not need to be locked.
aten/src/ATen/core/Generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to think more clearly about what thread-safety guarantees we want this class to give. This is important because you've made device_ and current_seed_ protected variables, which means that people who write subclasses can access these directly (without going through the locked getter); if you don't give clear guidance about thread safety expectations, they will probably get it wrong.
My understanding (which could be wrong) is that the most important thing to make thread safe is the random number generator state itself; changes to device or current seed don't actually happen concurrently from different threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an important point about protected variables that I completely overlooked! You are right that the most important thing to lock is the rng. I see this pattern that people like to call torch.manual_seed() before they use their rand functions. So I thought if that would raise thread safety issue if I don't lock the getters and setters as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like usually manual seed is called at the beginning of the program, or from a single threaded context. So that's why I hypothesize it's OK not to lock. Although it's going to setup the RNG state, which we should lock just for good reason, so maybe not.
TBH, the fact that we are sharing RNG state across threads is kind of crazy, it really should be a thread local concept. But I agree this is consistent with what we did previously, so...
aten/src/ATen/core/Generator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever instantiate a Generator (and not a subclass)? If not, we should just declare this pure virtual and remove the stub immediately. If that's not possible, I'd prefer if this raised an error if it was called at runtime, rather than just ignore the seed parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also support making Generator as pure virtual but got confused when all the CPU and CUDA methods (manualSeedAll) were put in there (may be people didn't wanna do dynamic_cast and call subclass's specialized method directly from Generator*/Generator&?). But either way, I didn't see an instance where we create Generator object, but we just pass Generator* in function signatures, which is then dynamic_cast'ed to the specific backend the code needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately for you, this is in return position, and C++ supports covariant return types. This means you can override the method with a refined return type (CPUGenerator) and it is still a valid override.
aten/src/ATen/CPUGenerator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just drop the detail block below the class, then you don't need a forward declaration anymore
aten/src/ATen/CPUGenerator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong, you shouldn't declare variables in headers. Move these to the cpp file.
aten/src/ATen/CPUGenerator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these really internal? You call them from a lot of places... what's the normal way to get one of these generators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ideally I wanted to do the following from Context.h:
auto getDefaultGenerator(at::CPU) -> gives me CPUGenerator
auto getDefaultGenerator(at::CUDA) -> gives me CUDAGenerator
auto getDefaultGenerator(at::device({at::kCUDA, 1}) -> gives me CUDAGenerator at device index 1
... same for createGenerator
I thought I could hack this behavior using SFINAE (similar to what I did in the PhiloxRNGEngine's operator() call) but ended giving up on it after not being able to fix template errors. So now I am creating this API inside these subclasses, such that when somebody needs a CPUGenerator, they can just call this instead of getDefaultGenerator which returns a Generator&. All this to avoid having to dynamic cast all the time in the back end, when needing specialized class methods (such as random(), random64() methods from the CPUGenerator()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds fine, maybe they shouldn't be internal ;)
aten/src/ATen/CPUGenerator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should resolve the question of thread-safety first (though I agree that copies should be mutex protected), but I do wonder about how much this idiom of taking lock_guard as an argument is giving us. Is this a recommendation somewhere? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just paranoid about thread safety and just followed this: https://www.justsoftwaresolutions.co.uk/threading/thread-safe-copy-constructors.html. Also just wanted copy constructors but having the mutex wouldn't give me auto generated ones, so did the whole google search how to get my copy constructors thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is moot since we're killing all the copy constructors
aten/src/ATen/Context.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, eventually, are we going to put CPUGenerator into the generator registry? We certainly can't keep doing the current strategy (if-statements) when CUDA rolls around, because you won't be allowed to access that symbol from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem with generator_registry is, CUDAGenerator will have its own std::vector of default generators, where the size of std::vector is the number of GPUs in the machine. Currently size of generator_registry is the number of backends in PyTorch. Hence, For the symbols, I was planning on the doing the if statements and call at::getCUDAHooks().getCUDAGenerator(device). Wouldn't that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work, I suppose, but in that case we might as well just give up on the generator registry and do this entirely through hooks.
aten/src/ATen/Utils.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume RTTI is available on all platforms we support. You should add a tag to the Generator superclass and use that to test for the types in question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use c++ static_cast here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked these algorithms yet
|
cc @cpuhrsch for codegen changes |
|
This is looking pretty good, and a definite improvement from before! There is some C++ stuff that needs to be fixed first however. |
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c++ stuff
|
Question: do we want to expose the |
|
Yep, Syed is planning on doing that in follow up PRs. |
7daf7b1 to
1d6980e
Compare
| gen->set_engine(engine_); | ||
| gen->set_next_float_normal_sample(next_float_normal_sample_); | ||
| gen->set_next_double_normal_sample(next_double_normal_sample_); | ||
| return gen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this on my first review, but could we have this return a std::unique_ptr<CloneableGenerator<CPUGenerator, Generator>> instead of a raw pointer that is manually memory managed? If you want to do this in a follow-up PR, that's OK (if this is the only problem I'll initiate land without a fix.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pointer gets wrapped by a unique pointer: https://github.com/pytorch/pytorch/pull/16604/files#diff-621abddec4e628a1cd833e601c92751eR14 . It has to remain a raw pointer to support covariant unique_ptr return type in my understanding: https://www.fluentcpp.com/2017/09/12/how-to-return-a-smart-pointer-and-use-covariance/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot about that. Disregard!
| * in torch.get_rng_state() and torch.set_rng_state(). | ||
| * It is a legacy class and even though it is replaced with | ||
| * at::CPUGenerator, we need this class and some of its fields | ||
| * to support backward compatibility on loading checkpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure we need to maintain BC here, but since you've done so, I'll take it. We can always break BC later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, you've implemented a limited form of BC, where you can load old checkpoints, but saved checkpoints are incompatible with new versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wanted to support loading of old checkpoints in set_rng_state, however, get_rng_state would return a THGeneratorNew as a byte tensor.
|
Re testing: the RNG state isn't that large, right? Can we just include a hex-encoded binary in the test, so that we can actually keep it in the test suite? |
| * surprises as mentioned in this article: | ||
| * http://www.pcg-random.org/posts/cpp-seeding-surprises.html | ||
| */ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment will come back once @fritzo and @neerajprad gets a chance to investigate why changing initialization scheme of mt19937 breaks test_distributions.py
| bool seeded_; | ||
| uint32_t next_; | ||
| uint32_t state_[MERSENNE_STATE_N]; | ||
| std::array<uint32_t, MERSENNE_STATE_N> state_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of plausible reasons for the changes here, but since there isn't commentary about it in comments anywhere, I'm left guessing. I think the changes are OK, but can you spell out why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because of supporting BC set_rng_state. Since THGeneratorState has all the members of the mt19937, I needed to copy in the raw array and other members. Hence, created a constructor and std::array was the way to go to have it as a parameter in the constructor.
aten/src/TH/THGenerator.hpp
Outdated
| at::mt19937 engine; | ||
| c10::optional<float> next_float_normal_sample; | ||
| c10::optional<double> next_double_normal_sample; | ||
| }; No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your editor is dropping trailing newlines again XD
aten/src/TH/THGenerator.hpp
Outdated
| struct THGeneratorStateNew { | ||
| at::mt19937 engine; | ||
| c10::optional<float> next_float_normal_sample; | ||
| c10::optional<double> next_double_normal_sample; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, c10::optional is not POD. You'll have to split them into separate float and bool fields.
Also, if you want to assert something is POD, best to do a static assert, e.g., static_assert(std::is_pod<Params>::value, "Params is not POD");
| auto double_normal_sample = c10::optional<double>(); | ||
|
|
||
| // Construct the state of at::CPUGenerator based on input byte tensor size. | ||
| if (THTensor_(nElement)(self) == size_legacy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dangerous to me. If you ever modify THGeneratorStateNew so that its size coincides with THGeneratorState, you have a really nasty bug on your hand where we'll read in the RNG from the wrong direction.
I suppose I could be convinced that this situation is incredibly unlikely to happen, since it's not every day you're going to muck around with the rng state. But at the very least, this warrants a static assert that the sizeof THGeneratorState doesn't match sizeof THGeneratorStateNew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are serious about catching these errors, we need some sort of header that we can use to identify the format of the RNG state.
| static const size_t size = sizeof(at::CPUGenerator); | ||
| at::CPUGenerator* rng_state; | ||
| THArgCheck(THTensor_(nElement)(self) == size, 1, "RNG state is wrong size"); | ||
| auto cast_generator = static_cast<at::CPUGenerator*>(_generator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't catch this previously, but this now looks a little skeevy to me. Can you guarantee that we actually have a CPUGenerator at this point in time? If we can't guarantee it, you better have a check.
| auto r = legacy_state->normal_rho; | ||
| auto theta = 2.0 * M_PI * legacy_state->normal_x; | ||
| // we return the sin version of the normal sample when in caching mode | ||
| double_normal_sample = c10::optional<double>(r * ::sin(theta)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually give you identical behavior to the old sampling behavior? I'm not convinced, because you only set double_normal_sample here, and not float_normal_sample, but I imagine the old state was utilized both for float and double sampling. (I don't think there is necessarily anything wrong with this, but when it comes to BC guarantees, we should call a spade a spade, and say what we are actually preserving, and what we are not. What we seem to have here is "best effort" BC--we won't crash, but the random numbers won't necessarily be the same.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we would get identical behavior, since I actually haven't utilized the float versions that I'm storing and have used at::normal_distribution<double> throughout (since that's what being done in TH right now, THRandom_normal returns a double, and is utilized for both float and double dispatch).
| // update next_double_normal_sample | ||
| double_normal_sample = rng_state->next_double_normal_sample; | ||
| } else { | ||
| AT_ERROR("RNG state is wrong size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to put more diagnostic info here: how big was the RNG state, and what were the valid sizes?
|
I consider the POD problem and the lack of static assert on size between THGeneratorState{,New} to be land blocking. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@ezyang I have addressed your latest review
Regarding the testing of old checkpoint: the rng binary looks big: https://gist.github.com/syed-ahmed/fc2f3635a02cba12342c204d54754960#gistcomment-2897557 |
OK, let's not :) |
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Hey @syed-ahmed, here's an update on the status of landing this PR: the diff itself is passing PyTorch/Caffe2 tests, but there are some C++ BC-breaking changes which I will need to fix upstream. Unfortunately, I'll be on PTO next week, so I won't get to it until the week after. Additionally, the change in RNG has perturbed a few tests. I don't remember if this PR is supposed to produce exactly the same random numbers or not. We should probably fix these tests (esp. since you're planning to change the RNG engine), but if this was NOT expected for this diff, we should investigate more. and |
|
@syed-ahmed I think a good idea is to write "how to migrate your code" release notes (and if we can't write these migration notes in a reasonable way, we should add back some methods for BC). Things to cover:
One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces. |
This PR is supposed to produce exactly the same random numbers. One reason for the failing tests could be how I'm reading from /dev/random. If there are no Previously, we did this with: https://github.com/pytorch/pytorch/blob/master/aten/src/TH/THRandom.cpp#L62, which just static casted a 32 bit number from time(0), whereas I concat two 32 bit numbers in my function. I could either just change getNonDeterministicRandom() to do the same static cast and then you can run the internal CI again, or a solution for the failing test in the internal CI could be to set a manual_seed at the beginning of the test. I am looking into writing the migration notes and will ping again about the tricky part - |
I'm OK with running this test (if you paste a diff in here I can apply it), but the explanation you've given me doesn't make sense to me. If the internal tests were previously seeding based on the current time, I would expect them to be similarly flaky (under the assumption that "bad" seeds are uniformly distributed among the space of 64-bit integers. I suppose it's possible that actually they're not distributed this way, but that still seems pretty weird). Thank you for the migration notes; it will also help me fix the internal use sites :) (Or they might also suggest to you that we can improve BC by reintroducing some old functions but just deprecating them.) |
|
I did a quick check to see if we were generating the same random numbers. master your PR So, at least normal generation looks like it's fine. (Could it be a problem in one of the other RNG generators? Hard to say...) |
This PR originated from #13070.
Summary:
The purpose of this PR is to refactor Random Number Generator (RNG) design in ATen. Currently, RNGs in PyTorch has an assymetrical design, i.e. CPU Generators use an ATen class, whereas CUDA Generators use legacy THC code (
THCRNGState, THCState, THCRandom_Initetc.). Moreover, the concept of generators in ATen aren't clear from its current design. This PR is the first part of more refactoring effort surrounding RNGs and only handles the PyTorch front-end and CPU backend for now. It does the following:function_wrapper.pyetc.torch.Generatorapi.