[CPU] Refactor Random Number Generators in ATen - Part 1/n #16604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

syed-ahmed wants to merge 85 commits into pytorch:master from syed-ahmed:generator-refactor-cpu

Collaborator

syed-ahmed commented Jan 31, 2019 •

edited

Loading

This PR originated from #13070.

Summary:

The purpose of this PR is to refactor Random Number Generator (RNG) design in ATen. Currently, RNGs in PyTorch has an assymetrical design, i.e. CPU Generators use an ATen class, whereas CUDA Generators use legacy THC code (THCRNGState, THCState, THCRandom_Init etc.). Moreover, the concept of generators in ATen aren't clear from its current design. This PR is the first part of more refactoring effort surrounding RNGs and only handles the PyTorch front-end and CPU backend for now. It does the following:

Clarifies generator concepts by reviewing Generator, CPUGenerator and CUDAGenerator classes.
Moves mt19937 from TH to aten as MT19937RNGEngine.h and also moves distributions from THRandom.cpp to DistributionsHelper.h. Adds PhiloxRNGEngine.h engine and adds unit tests for it.
Fixes hardcoded generator related code in several python files used for code generation, such as function_wrapper.py etc.
Fixes generator front-end python bindings to include device kwarg and default kwarg
Removes creation of generator from Types.
Updates documentations and comments and adds documentation for torch.Generator api.

Contributor

ezyang commented Feb 1, 2019

This isn't WIP anymore right? Better remove the [WIP] tag in that case ;)

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.h Outdated

Contributor

ezyang Feb 1, 2019

Nice docs! Thank you!

Contributor

ezyang Feb 1, 2019

This class is CPU only in this PR, is that right?

Collaborator Author

syed-ahmed Feb 1, 2019

the torch.Generator() api is CPU only in this PR (also it's cpu only in master as well):
https://github.com/pytorch/pytorch/blob/2b826c6171608bebd96be6dd1f8201ab6a0a8dff/torch/csrc/Generator.cpp#L47-L69
In the second PR, I want to be able to create CUDAGenerator when I pass the device="cuda" in torch.Generator()

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.h Outdated

Contributor

ezyang Feb 1, 2019

OOC, where did we get this number from?

Collaborator Author

syed-ahmed Feb 1, 2019

So since the Philox engine has two 32-bit keys (that is it's state is 64 bit I think?), it is recommended that a RNG engine should be seeded with the same number of bits as its state (I think I read this somewhere in one of the links I posted in the last PR). Also the seed value should have a good distribution of 0s and 1s in the bit representation. So I chose a large prime number that can be fit into uint64_t: https://primes.utm.edu/curios/page.php/67280421310721.html. You'll notice another number in:
https://github.com/pytorch/pytorch/blob/2b826c6171608bebd96be6dd1f8201ab6a0a8dff/test/common_utils.py#L46-L48
python side won't let me seed with 64 bit? So I just chose a large prime there under 32 bit.
¯_(ツ)_/¯

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.h Outdated

Contributor

ezyang Feb 1, 2019

This copy constructor seems pretty dangerous, because you're planning to subclass this class, and add fields to the subclasses. When you allow copy construction of classes with subclasses, you are exposing yourself to the possibility of object slicing (https://stackoverflow.com/questions/274626/what-is-object-slicing). It is probably best if we don't allow copy constructor at all, and instead have a virtual method for doing clones.

Contributor

ezyang Feb 1, 2019 •

edited

Loading

So, we should disable all copy/move constructors/assignment on the superclass Generator.

Contributor

smessmer Feb 8, 2019

it's not an object slicing issue as long as you take other per reference, but disabling copy constructors where not needed is always a good idea.

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.h Outdated

Contributor

ezyang Feb 1, 2019

nit: mutex_

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.cpp Outdated

Contributor

ezyang Feb 1, 2019

This looks unnecessary. I'd be very happy to say that device_ is immutable and therefore does not need to be locked.

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.h Outdated

Contributor

ezyang Feb 1, 2019

I think we need to think more clearly about what thread-safety guarantees we want this class to give. This is important because you've made device_ and current_seed_ protected variables, which means that people who write subclasses can access these directly (without going through the locked getter); if you don't give clear guidance about thread safety expectations, they will probably get it wrong.

My understanding (which could be wrong) is that the most important thing to make thread safe is the random number generator state itself; changes to device or current seed don't actually happen concurrently from different threads.

Collaborator Author

syed-ahmed Feb 2, 2019 •

edited

Loading

That is an important point about protected variables that I completely overlooked! You are right that the most important thing to lock is the rng. I see this pattern that people like to call torch.manual_seed() before they use their rand functions. So I thought if that would raise thread safety issue if I don't lock the getters and setters as well.

Contributor

ezyang Feb 2, 2019

I feel like usually manual seed is called at the beginning of the program, or from a single threaded context. So that's why I hypothesize it's OK not to lock. Although it's going to setup the RNG state, which we should lock just for good reason, so maybe not.

TBH, the fact that we are sharing RNG state across threads is kind of crazy, it really should be a thread local concept. But I agree this is consistent with what we did previously, so...

ezyang reviewed

View reviewed changes

aten/src/ATen/core/Generator.cpp Outdated

Contributor

ezyang Feb 1, 2019

Do we ever instantiate a Generator (and not a subclass)? If not, we should just declare this pure virtual and remove the stub immediately. If that's not possible, I'd prefer if this raised an error if it was called at runtime, rather than just ignore the seed parameter.

Collaborator Author

syed-ahmed Feb 2, 2019

I would also support making Generator as pure virtual but got confused when all the CPU and CUDA methods (manualSeedAll) were put in there (may be people didn't wanna do dynamic_cast and call subclass's specialized method directly from Generator*/Generator&?). But either way, I didn't see an instance where we create Generator object, but we just pass Generator* in function signatures, which is then dynamic_cast'ed to the specific backend the code needs.

Contributor

ezyang Feb 2, 2019

Fortunately for you, this is in return position, and C++ supports covariant return types. This means you can override the method with a refined return type (CPUGenerator) and it is still a valid override.

ezyang reviewed

View reviewed changes

aten/src/ATen/CPUGenerator.h Outdated

Contributor

ezyang Feb 1, 2019

nit: Just drop the detail block below the class, then you don't need a forward declaration anymore

ezyang reviewed

View reviewed changes

aten/src/ATen/CPUGenerator.h Outdated

Contributor

ezyang Feb 1, 2019

This looks wrong, you shouldn't declare variables in headers. Move these to the cpp file.

ezyang reviewed

View reviewed changes

aten/src/ATen/CPUGenerator.h Outdated

Contributor

ezyang Feb 1, 2019

Are these really internal? You call them from a lot of places... what's the normal way to get one of these generators?

Collaborator Author

syed-ahmed Feb 2, 2019 •

edited

Loading

So ideally I wanted to do the following from Context.h:

auto getDefaultGenerator(at::CPU) -> gives me CPUGenerator
auto getDefaultGenerator(at::CUDA) -> gives me CUDAGenerator
auto getDefaultGenerator(at::device({at::kCUDA, 1}) -> gives me CUDAGenerator at device index 1
... same for createGenerator

I thought I could hack this behavior using SFINAE (similar to what I did in the PhiloxRNGEngine's operator() call) but ended giving up on it after not being able to fix template errors. So now I am creating this API inside these subclasses, such that when somebody needs a CPUGenerator, they can just call this instead of getDefaultGenerator which returns a Generator&. All this to avoid having to dynamic cast all the time in the back end, when needing specialized class methods (such as random(), random64() methods from the CPUGenerator()).

Contributor

ezyang Feb 2, 2019

This sounds fine, maybe they shouldn't be internal ;)

ezyang reviewed

View reviewed changes

aten/src/ATen/CPUGenerator.cpp Outdated

Contributor

ezyang Feb 1, 2019

We should resolve the question of thread-safety first (though I agree that copies should be mutex protected), but I do wonder about how much this idiom of taking lock_guard as an argument is giving us. Is this a recommendation somewhere? :)

Collaborator Author

syed-ahmed Feb 2, 2019

I was just paranoid about thread safety and just followed this: https://www.justsoftwaresolutions.co.uk/threading/thread-safe-copy-constructors.html. Also just wanted copy constructors but having the mutex wouldn't give me auto generated ones, so did the whole google search how to get my copy constructors thread safe.

Contributor

ezyang Feb 2, 2019

I guess this is moot since we're killing all the copy constructors

ezyang reviewed

View reviewed changes

aten/src/ATen/Context.h Outdated

Contributor

ezyang Feb 1, 2019

So, eventually, are we going to put CPUGenerator into the generator registry? We certainly can't keep doing the current strategy (if-statements) when CUDA rolls around, because you won't be allowed to access that symbol from here.

Collaborator Author

syed-ahmed Feb 1, 2019

So the problem with generator_registry is, CUDAGenerator will have its own std::vector of default generators, where the size of std::vector is the number of GPUs in the machine. Currently size of generator_registry is the number of backends in PyTorch. Hence, For the symbols, I was planning on the doing the if statements and call at::getCUDAHooks().getCUDAGenerator(device). Wouldn't that work?

Contributor

ezyang Feb 2, 2019

That would work, I suppose, but in that case we might as well just give up on the generator registry and do this entirely through hooks.

ezyang reviewed

View reviewed changes

aten/src/ATen/Utils.h Outdated

Contributor

ezyang Feb 1, 2019

We can't assume RTTI is available on all platforms we support. You should add a tag to the Generator superclass and use that to test for the types in question.

ezyang reviewed

View reviewed changes

aten/src/ATen/core/DistributionsHelper.h Outdated

Contributor

ezyang Feb 1, 2019

nit: use c++ static_cast here

ezyang reviewed

View reviewed changes

aten/src/ATen/core/DistributionsHelper.h Outdated

Contributor

ezyang Feb 1, 2019

I haven't checked these algorithms yet

Contributor

ezyang commented Feb 1, 2019

cc @cpuhrsch for codegen changes

Contributor

ezyang commented Feb 1, 2019

This is looking pretty good, and a definite improvement from before! There is some C++ stuff that needs to be fixed first however.

ezyang requested changes

View reviewed changes

Contributor

ezyang left a comment

c++ stuff

syed-ahmed changed the title ~~[WIP][CPU] Refactor Random Number Generators in ATen~~ [CPU] Refactor Random Number Generators in ATen

syed-ahmed changed the title ~~[CPU] Refactor Random Number Generators in ATen~~ [CPU] Refactor Random Number Generators in ATen - Part 1/2

soumith mentioned this pull request

Very poor Uniform() sampling near floating 0.0 #16706

Open

Member

fmassa commented Feb 4, 2019

Question: do we want to expose the CUDAGenerator to python as well?
For context #15867

Contributor

ezyang commented Feb 4, 2019

Yep, Syed is planning on doing that in follow up PRs.

t-vi mentioned this pull request

Numpy-Like RandomState object #17079

Closed

cpuhrsch self-requested a review

February 14, 2019 23:10

syed-ahmed force-pushed the generator-refactor-cpu branch 4 times, most recently from 7daf7b1 to 1d6980e Compare

February 27, 2019 18:36

syed-ahmed requested a review from ezyang

April 24, 2019 05:30

ezyang reviewed

View reviewed changes

aten/src/ATen/CPUGenerator.cpp

    
                gen->set_engine(engine_);

                gen->set_next_float_normal_sample(next_float_normal_sample_);

                gen->set_next_double_normal_sample(next_double_normal_sample_);

                return gen;

Contributor

ezyang Apr 24, 2019

I missed this on my first review, but could we have this return a std::unique_ptr<CloneableGenerator<CPUGenerator, Generator>> instead of a raw pointer that is manually memory managed? If you want to do this in a follow-up PR, that's OK (if this is the only problem I'll initiate land without a fix.)

Collaborator Author

syed-ahmed Apr 24, 2019 •

edited

Loading

This pointer gets wrapped by a unique pointer: https://github.com/pytorch/pytorch/pull/16604/files#diff-621abddec4e628a1cd833e601c92751eR14 . It has to remain a raw pointer to support covariant unique_ptr return type in my understanding: https://www.fluentcpp.com/2017/09/12/how-to-return-a-smart-pointer-and-use-covariance/

Contributor

ezyang Apr 24, 2019

Oh, I forgot about that. Disregard!

ezyang reviewed

View reviewed changes

aten/src/TH/THGenerator.hpp

    
               * in torch.get_rng_state() and torch.set_rng_state().

               * It is a legacy class and even though it is replaced with

               * at::CPUGenerator, we need this class and some of its fields

               * to support backward compatibility on loading checkpoints.

Contributor

ezyang Apr 24, 2019

I'm actually not sure we need to maintain BC here, but since you've done so, I'll take it. We can always break BC later.

Contributor

ezyang Apr 24, 2019

Oh, I see, you've implemented a limited form of BC, where you can load old checkpoints, but saved checkpoints are incompatible with new versions.

Collaborator Author

syed-ahmed Apr 24, 2019

Yes, I wanted to support loading of old checkpoints in set_rng_state, however, get_rng_state would return a THGeneratorNew as a byte tensor.

Contributor

ezyang commented Apr 24, 2019

Re testing: the RNG state isn't that large, right? Can we just include a hex-encoded binary in the test, so that we can actually keep it in the test suite?

ezyang reviewed

View reviewed changes

aten/src/ATen/core/MT19937RNGEngine.h

    
                 * surprises as mentioned in this article:

                 * http://www.pcg-random.org/posts/cpp-seeding-surprises.html

                 */

Contributor

ezyang Apr 24, 2019

What happened to this comment?

Collaborator Author

syed-ahmed Apr 24, 2019

This comment will come back once @fritzo and @neerajprad gets a chance to investigate why changing initialization scheme of mt19937 breaks test_distributions.py

ezyang reviewed

View reviewed changes

aten/src/ATen/core/MT19937RNGEngine.h Outdated

    
                bool seeded_;

                uint32_t next_;

                uint32_t state_[MERSENNE_STATE_N];

                std::array<uint32_t, MERSENNE_STATE_N> state_;

Contributor

ezyang Apr 24, 2019

I can think of plausible reasons for the changes here, but since there isn't commentary about it in comments anywhere, I'm left guessing. I think the changes are OK, but can you spell out why?

Collaborator Author

syed-ahmed Apr 24, 2019

This is because of supporting BC set_rng_state. Since THGeneratorState has all the members of the mt19937, I needed to copy in the raw array and other members. Hence, created a constructor and std::array was the way to go to have it as a parameter in the constructor.

ezyang reviewed

View reviewed changes

aten/src/TH/THGenerator.hpp Outdated

    
                at::mt19937 engine;

                c10::optional<float> next_float_normal_sample;

                c10::optional<double> next_double_normal_sample;

              };

  No newline at end of file

Contributor

ezyang Apr 24, 2019

Your editor is dropping trailing newlines again XD

ezyang reviewed

View reviewed changes

aten/src/TH/THGenerator.hpp Outdated

    
              struct THGeneratorStateNew {

                at::mt19937 engine;

                c10::optional<float> next_float_normal_sample;

                c10::optional<double> next_double_normal_sample;

Contributor

ezyang Apr 24, 2019

Unfortunately, c10::optional is not POD. You'll have to split them into separate float and bool fields.

Also, if you want to assert something is POD, best to do a static assert, e.g., static_assert(std::is_pod<Params>::value, "Params is not POD");

ezyang reviewed

View reviewed changes

aten/src/TH/generic/THTensorRandom.cpp

    
                auto double_normal_sample = c10::optional<double>();

                // Construct the state of at::CPUGenerator based on input byte tensor size.

                if (THTensor_(nElement)(self) == size_legacy) {

Contributor

ezyang Apr 24, 2019

This looks dangerous to me. If you ever modify THGeneratorStateNew so that its size coincides with THGeneratorState, you have a really nasty bug on your hand where we'll read in the RNG from the wrong direction.

I suppose I could be convinced that this situation is incredibly unlikely to happen, since it's not every day you're going to muck around with the rng state. But at the very least, this warrants a static assert that the sizeof THGeneratorState doesn't match sizeof THGeneratorStateNew

Contributor

ezyang Apr 24, 2019

If you are serious about catching these errors, we need some sort of header that we can use to identify the format of the RNG state.

ezyang reviewed

View reviewed changes

aten/src/TH/generic/THTensorRandom.cpp Outdated

    
                static const size_t size = sizeof(at::CPUGenerator);

                at::CPUGenerator* rng_state;

                THArgCheck(THTensor_(nElement)(self) == size, 1, "RNG state is wrong size");

                auto cast_generator = static_cast<at::CPUGenerator*>(_generator);

Contributor

ezyang Apr 24, 2019

I didn't catch this previously, but this now looks a little skeevy to me. Can you guarantee that we actually have a CPUGenerator at this point in time? If we can't guarantee it, you better have a check.

ezyang reviewed

View reviewed changes

aten/src/TH/generic/THTensorRandom.cpp

    
                    auto r = legacy_state->normal_rho;

                    auto theta = 2.0 * M_PI * legacy_state->normal_x;

                    // we return the sin version of the normal sample when in caching mode

                    double_normal_sample = c10::optional<double>(r * ::sin(theta));

Contributor

ezyang Apr 24, 2019

Does this actually give you identical behavior to the old sampling behavior? I'm not convinced, because you only set double_normal_sample here, and not float_normal_sample, but I imagine the old state was utilized both for float and double sampling. (I don't think there is necessarily anything wrong with this, but when it comes to BC guarantees, we should call a spade a spade, and say what we are actually preserving, and what we are not. What we seem to have here is "best effort" BC--we won't crash, but the random numbers won't necessarily be the same.)

Collaborator Author

syed-ahmed Apr 24, 2019

Yes, we would get identical behavior, since I actually haven't utilized the float versions that I'm storing and have used at::normal_distribution<double> throughout (since that's what being done in TH right now, THRandom_normal returns a double, and is utilized for both float and double dispatch).

ezyang reviewed

View reviewed changes

aten/src/TH/generic/THTensorRandom.cpp Outdated

    
                  // update next_double_normal_sample

                  double_normal_sample = rng_state->next_double_normal_sample;

                } else {

                  AT_ERROR("RNG state is wrong size");

Contributor

ezyang Apr 24, 2019

It would be nice to put more diagnostic info here: how big was the RNG state, and what were the valid sizes?

Contributor

ezyang commented Apr 24, 2019

I consider the POD problem and the lack of static assert on size between THGeneratorState{,New} to be land blocking.

facebook-github-bot reviewed

View reviewed changes

Contributor

facebook-github-bot left a comment

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

syed-ahmed added 2 commits

April 24, 2019 16:33


          Fix POD problem and add static asserts

c19219b


          Merge remote-tracking branch 'upstream/master' into generator-refacto…

ef6ef9d

…r-cpu

Collaborator Author

syed-ahmed commented Apr 24, 2019

@ezyang I have addressed your latest review
Changes are

Fixed the POD issues, added a mt19937_data_pod to get mt19937 data in and out, broke c10::optional into bool and float
Added static asserts on sizes (let's do the header solution in the future, when we can move get_rng_state, set_rng_state, natively in ATen) and added diagnostics in error messages
Added check for CPUGenerator in get/set_rng_state.

Regarding the testing of old checkpoint: the rng binary looks big: https://gist.github.com/syed-ahmed/fc2f3635a02cba12342c204d54754960#gistcomment-2897557
Let me know whether you still want this in the test suite.

syed-ahmed requested a review from ezyang

April 24, 2019 23:43

Contributor

ezyang commented Apr 25, 2019

Regarding the testing of old checkpoint: the rng binary looks big: https://gist.github.com/syed-ahmed/fc2f3635a02cba12342c204d54754960#gistcomment-2897557
Let me know whether you still want this in the test suite.

OK, let's not :)


          Fix include

cba3a68

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ezyang dismissed their stale review

April 25, 2019 13:28

fixed

facebook-github-bot reviewed

View reviewed changes

Contributor

facebook-github-bot left a comment

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Contributor

ezyang commented Apr 26, 2019

Hey @syed-ahmed, here's an update on the status of landing this PR: the diff itself is passing PyTorch/Caffe2 tests, but there are some C++ BC-breaking changes which I will need to fix upstream. Unfortunately, I'll be on PTO next week, so I won't get to it until the week after.

TensorImpl.cpp:44:28: error: ‘class at::Context’ has no member named ‘defaultGenerator’
     torch::globalContext().defaultGenerator(torch::kCPU);
                            ^~~~~~~~~~~~~~~~
TensorImpl.cpp:45:11: error: ‘struct at::Generator’ has no member named ‘manualSeed’; did you mean ‘manualSeedAll’?
   cpu_gen.manualSeed(seed);
           ^~~~~~~~~~
TensorImpl.cpp:44:28: error: ‘class at::Context’ has no member named ‘defaultGenerator’
     torch::globalContext().defaultGenerator(torch::kCPU);
                            ^~~~~~~~~~~~~~~~
TensorImpl.cpp:45:11: error: ‘struct at::Generator’ has no member named ‘manualSeed’; did you mean ‘manualSeedAll’?
   cpu_gen.manualSeed(seed);
           ^~~~~~~~~~

Additionally, the change in RNG has perturbed a few tests. I don't remember if this PR is supposed to produce exactly the same random numbers or not. We should probably fix these tests (esp. since you're planning to change the RNG engine), but if this was NOT expected for this diff, we should investigate more.

tensor1 (torch.Size([4, 4])) and tensor2 (torch.Size([4, 4])) are not close enough. 
max rtol: 0.39336044	max atol: 0.30065852

and

    self.assertAlmostEqual(output, -25.48, places=2)
AssertionError: -26.65626863950947 != -25.48 within 2 places

Contributor

ezyang commented Apr 26, 2019

@syed-ahmed I think a good idea is to write "how to migrate your code" release notes (and if we can't write these migration notes in a reasonable way, we should add back some methods for BC). Things to cover:

the torch::Generator type is different now, right? If you previously wrote torch::Generator& x = ... in your code, what should I write now?
If any methods from at::Context were removed, document their replacements
If any methods on what previously was torch::Generator changed, document their replacements

One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces.

Collaborator Author

syed-ahmed commented May 8, 2019

Additionally, the change in RNG has perturbed a few tests. I don't remember if this PR is supposed to produce exactly the same random numbers or not. We should probably fix these tests (esp. since you're planning to change the RNG engine), but if this was NOT expected for this diff, we should investigate more.

This PR is supposed to produce exactly the same random numbers. One reason for the failing tests could be how I'm reading from /dev/random. If there are no torch.manual_seed() in the failing tests, that would mean that the seed is set by getNonDeterministicRandom()

uint64_t getNonDeterministicRandom() {
  std::random_device rd;
  uint32_t random1;
  uint32_t random2;
  if (rd.entropy() != 0) {
    random1 = rd();
    random2 = rd();
    return make64BitsFrom32Bits(random1, random2);
  }
  else {
    random1 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    random2 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    return make64BitsFrom32Bits(random1, random2);
  }
}

Previously, we did this with: https://github.com/pytorch/pytorch/blob/master/aten/src/TH/THRandom.cpp#L62, which just static casted a 32 bit number from time(0), whereas I concat two 32 bit numbers in my function. I could either just change getNonDeterministicRandom() to do the same static cast and then you can run the internal CI again, or a solution for the failing test in the internal CI could be to set a manual_seed at the beginning of the test.

I am looking into writing the migration notes and will ping again about the tricky part - So I'm not sure how many BC problems being only partially refactored introduces

Contributor

ezyang commented May 8, 2019

I could either just change getNonDeterministicRandom() to do the same static cast and then you can run the internal CI again, or a solution for the failing test in the internal CI could be to set a manual_seed at the beginning of the test.

I'm OK with running this test (if you paste a diff in here I can apply it), but the explanation you've given me doesn't make sense to me. If the internal tests were previously seeding based on the current time, I would expect them to be similarly flaky (under the assumption that "bad" seeds are uniformly distributed among the space of 64-bit integers. I suppose it's possible that actually they're not distributed this way, but that still seems pretty weird).

Thank you for the migration notes; it will also help me fix the internal use sites :) (Or they might also suggest to you that we can improve BC by reintroducing some old functions but just deprecating them.)

Contributor

ezyang commented May 8, 2019

I did a quick check to see if we were generating the same random numbers.

master

(/home/ezyang/Dev/pytorch-tmp-env) [ezyang@devgpu005.ash6 ~/Dev/pytorch-tmp] python                     
Python 3.7.2 (default, Dec 29 2018, 06:19:36)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.                                  
>>> import torch
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7f21d1694af0>
>>> torch.randn(10)
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845, -1.3986,  0.4033,  0.8380,                         
        -0.7193, -0.4033])

your PR

(/home/ezyang/Dev/pytorch-tmp-env) [ezyang@devgpu005.ash6 ~/Dev/pytorch-tmp] python
Python 3.7.2 (default, Dec 29 2018, 06:19:36) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7f5b08b74ab0>
>>> torch.randn(10)
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845, -1.3986,  0.4033,  0.8380,
        -0.7193, -0.4033])

So, at least normal generation looks like it's fine. (Could it be a problem in one of the other RNG generators? Hard to say...)

syed-ahmed closed this

syed-ahmed mentioned this pull request

[CPU] Refactor Random Number Generators in ATen #21364

Closed

ezyang added the open source label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

cpuhrsch cpuhrsch left review comments

smessmer smessmer left review comments

facebook-github-bot facebook-github-bot left review comments

ngimel ngimel left review comments

ezyang Awaiting requested review from ezyang

+1 more reviewer

drnikolaev drnikolaev left review comments

Labels

module: random open source