Skip to content

Conversation

@syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Jan 31, 2019

This PR originated from #13070.

Summary:

The purpose of this PR is to refactor Random Number Generator (RNG) design in ATen. Currently, RNGs in PyTorch has an assymetrical design, i.e. CPU Generators use an ATen class, whereas CUDA Generators use legacy THC code (THCRNGState, THCState, THCRandom_Init etc.). Moreover, the concept of generators in ATen aren't clear from its current design. This PR is the first part of more refactoring effort surrounding RNGs and only handles the PyTorch front-end and CPU backend for now. It does the following:

  • Clarifies generator concepts by reviewing Generator, CPUGenerator and CUDAGenerator classes.
  • Moves mt19937 from TH to aten as MT19937RNGEngine.h and also moves distributions from THRandom.cpp to DistributionsHelper.h. Adds PhiloxRNGEngine.h engine and adds unit tests for it.
  • Fixes hardcoded generator related code in several python files used for code generation, such as function_wrapper.py etc.
  • Fixes generator front-end python bindings to include device kwarg and default kwarg
  • Removes creation of generator from Types.
  • Updates documentations and comments and adds documentation for torch.Generator api.

@ezyang
Copy link
Contributor

ezyang commented Feb 1, 2019

This isn't WIP anymore right? Better remove the [WIP] tag in that case ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docs! Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is CPU only in this PR, is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the torch.Generator() api is CPU only in this PR (also it's cpu only in master as well):
https://github.com/pytorch/pytorch/blob/2b826c6171608bebd96be6dd1f8201ab6a0a8dff/torch/csrc/Generator.cpp#L47-L69
In the second PR, I want to be able to create CUDAGenerator when I pass the device="cuda" in torch.Generator()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, where did we get this number from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So since the Philox engine has two 32-bit keys (that is it's state is 64 bit I think?), it is recommended that a RNG engine should be seeded with the same number of bits as its state (I think I read this somewhere in one of the links I posted in the last PR). Also the seed value should have a good distribution of 0s and 1s in the bit representation. So I chose a large prime number that can be fit into uint64_t: https://primes.utm.edu/curios/page.php/67280421310721.html. You'll notice another number in:
https://github.com/pytorch/pytorch/blob/2b826c6171608bebd96be6dd1f8201ab6a0a8dff/test/common_utils.py#L46-L48
python side won't let me seed with 64 bit? So I just chose a large prime there under 32 bit.
¯_(ツ)_/¯

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copy constructor seems pretty dangerous, because you're planning to subclass this class, and add fields to the subclasses. When you allow copy construction of classes with subclasses, you are exposing yourself to the possibility of object slicing (https://stackoverflow.com/questions/274626/what-is-object-slicing). It is probably best if we don't allow copy constructor at all, and instead have a virtual method for doing clones.

Copy link
Contributor

@ezyang ezyang Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we should disable all copy/move constructors/assignment on the superclass Generator.

CC @smessmer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not an object slicing issue as long as you take other per reference, but disabling copy constructors where not needed is always a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mutex_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unnecessary. I'd be very happy to say that device_ is immutable and therefore does not need to be locked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think more clearly about what thread-safety guarantees we want this class to give. This is important because you've made device_ and current_seed_ protected variables, which means that people who write subclasses can access these directly (without going through the locked getter); if you don't give clear guidance about thread safety expectations, they will probably get it wrong.

My understanding (which could be wrong) is that the most important thing to make thread safe is the random number generator state itself; changes to device or current seed don't actually happen concurrently from different threads.

Copy link
Collaborator Author

@syed-ahmed syed-ahmed Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an important point about protected variables that I completely overlooked! You are right that the most important thing to lock is the rng. I see this pattern that people like to call torch.manual_seed() before they use their rand functions. So I thought if that would raise thread safety issue if I don't lock the getters and setters as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like usually manual seed is called at the beginning of the program, or from a single threaded context. So that's why I hypothesize it's OK not to lock. Although it's going to setup the RNG state, which we should lock just for good reason, so maybe not.

TBH, the fact that we are sharing RNG state across threads is kind of crazy, it really should be a thread local concept. But I agree this is consistent with what we did previously, so...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever instantiate a Generator (and not a subclass)? If not, we should just declare this pure virtual and remove the stub immediately. If that's not possible, I'd prefer if this raised an error if it was called at runtime, rather than just ignore the seed parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also support making Generator as pure virtual but got confused when all the CPU and CUDA methods (manualSeedAll) were put in there (may be people didn't wanna do dynamic_cast and call subclass's specialized method directly from Generator*/Generator&?). But either way, I didn't see an instance where we create Generator object, but we just pass Generator* in function signatures, which is then dynamic_cast'ed to the specific backend the code needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately for you, this is in return position, and C++ supports covariant return types. This means you can override the method with a refined return type (CPUGenerator) and it is still a valid override.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just drop the detail block below the class, then you don't need a forward declaration anymore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong, you shouldn't declare variables in headers. Move these to the cpp file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these really internal? You call them from a lot of places... what's the normal way to get one of these generators?

Copy link
Collaborator Author

@syed-ahmed syed-ahmed Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ideally I wanted to do the following from Context.h:

auto getDefaultGenerator(at::CPU) -> gives me CPUGenerator
auto getDefaultGenerator(at::CUDA) -> gives me CUDAGenerator
auto getDefaultGenerator(at::device({at::kCUDA, 1}) -> gives me CUDAGenerator at device index 1
... same for createGenerator

I thought I could hack this behavior using SFINAE (similar to what I did in the PhiloxRNGEngine's operator() call) but ended giving up on it after not being able to fix template errors. So now I am creating this API inside these subclasses, such that when somebody needs a CPUGenerator, they can just call this instead of getDefaultGenerator which returns a Generator&. All this to avoid having to dynamic cast all the time in the back end, when needing specialized class methods (such as random(), random64() methods from the CPUGenerator()).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds fine, maybe they shouldn't be internal ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should resolve the question of thread-safety first (though I agree that copies should be mutex protected), but I do wonder about how much this idiom of taking lock_guard as an argument is giving us. Is this a recommendation somewhere? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just paranoid about thread safety and just followed this: https://www.justsoftwaresolutions.co.uk/threading/thread-safe-copy-constructors.html. Also just wanted copy constructors but having the mutex wouldn't give me auto generated ones, so did the whole google search how to get my copy constructors thread safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is moot since we're killing all the copy constructors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, eventually, are we going to put CPUGenerator into the generator registry? We certainly can't keep doing the current strategy (if-statements) when CUDA rolls around, because you won't be allowed to access that symbol from here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with generator_registry is, CUDAGenerator will have its own std::vector of default generators, where the size of std::vector is the number of GPUs in the machine. Currently size of generator_registry is the number of backends in PyTorch. Hence, For the symbols, I was planning on the doing the if statements and call at::getCUDAHooks().getCUDAGenerator(device). Wouldn't that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work, I suppose, but in that case we might as well just give up on the generator registry and do this entirely through hooks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume RTTI is available on all platforms we support. You should add a tag to the Generator superclass and use that to test for the types in question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use c++ static_cast here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked these algorithms yet

@ezyang
Copy link
Contributor

ezyang commented Feb 1, 2019

cc @cpuhrsch for codegen changes

@ezyang
Copy link
Contributor

ezyang commented Feb 1, 2019

This is looking pretty good, and a definite improvement from before! There is some C++ stuff that needs to be fixed first however.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++ stuff

@syed-ahmed syed-ahmed changed the title [WIP][CPU] Refactor Random Number Generators in ATen [CPU] Refactor Random Number Generators in ATen Feb 1, 2019
@syed-ahmed syed-ahmed changed the title [CPU] Refactor Random Number Generators in ATen [CPU] Refactor Random Number Generators in ATen - Part 1/2 Feb 1, 2019
@fmassa
Copy link
Member

fmassa commented Feb 4, 2019

Question: do we want to expose the CUDAGenerator to python as well?
For context #15867

@ezyang
Copy link
Contributor

ezyang commented Feb 4, 2019

Yep, Syed is planning on doing that in follow up PRs.

@cpuhrsch cpuhrsch self-requested a review February 14, 2019 23:10
@syed-ahmed syed-ahmed force-pushed the generator-refactor-cpu branch 4 times, most recently from 7daf7b1 to 1d6980e Compare February 27, 2019 18:36
@syed-ahmed syed-ahmed requested a review from ezyang April 24, 2019 05:30
gen->set_engine(engine_);
gen->set_next_float_normal_sample(next_float_normal_sample_);
gen->set_next_double_normal_sample(next_double_normal_sample_);
return gen;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this on my first review, but could we have this return a std::unique_ptr<CloneableGenerator<CPUGenerator, Generator>> instead of a raw pointer that is manually memory managed? If you want to do this in a follow-up PR, that's OK (if this is the only problem I'll initiate land without a fix.)

Copy link
Collaborator Author

@syed-ahmed syed-ahmed Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pointer gets wrapped by a unique pointer: https://github.com/pytorch/pytorch/pull/16604/files#diff-621abddec4e628a1cd833e601c92751eR14 . It has to remain a raw pointer to support covariant unique_ptr return type in my understanding: https://www.fluentcpp.com/2017/09/12/how-to-return-a-smart-pointer-and-use-covariance/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot about that. Disregard!

* in torch.get_rng_state() and torch.set_rng_state().
* It is a legacy class and even though it is replaced with
* at::CPUGenerator, we need this class and some of its fields
* to support backward compatibility on loading checkpoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure we need to maintain BC here, but since you've done so, I'll take it. We can always break BC later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, you've implemented a limited form of BC, where you can load old checkpoints, but saved checkpoints are incompatible with new versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wanted to support loading of old checkpoints in set_rng_state, however, get_rng_state would return a THGeneratorNew as a byte tensor.

@ezyang
Copy link
Contributor

ezyang commented Apr 24, 2019

Re testing: the RNG state isn't that large, right? Can we just include a hex-encoded binary in the test, so that we can actually keep it in the test suite?

* surprises as mentioned in this article:
* http://www.pcg-random.org/posts/cpp-seeding-surprises.html
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment will come back once @fritzo and @neerajprad gets a chance to investigate why changing initialization scheme of mt19937 breaks test_distributions.py

bool seeded_;
uint32_t next_;
uint32_t state_[MERSENNE_STATE_N];
std::array<uint32_t, MERSENNE_STATE_N> state_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of plausible reasons for the changes here, but since there isn't commentary about it in comments anywhere, I'm left guessing. I think the changes are OK, but can you spell out why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of supporting BC set_rng_state. Since THGeneratorState has all the members of the mt19937, I needed to copy in the raw array and other members. Hence, created a constructor and std::array was the way to go to have it as a parameter in the constructor.

at::mt19937 engine;
c10::optional<float> next_float_normal_sample;
c10::optional<double> next_double_normal_sample;
}; No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your editor is dropping trailing newlines again XD

struct THGeneratorStateNew {
at::mt19937 engine;
c10::optional<float> next_float_normal_sample;
c10::optional<double> next_double_normal_sample;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, c10::optional is not POD. You'll have to split them into separate float and bool fields.

Also, if you want to assert something is POD, best to do a static assert, e.g., static_assert(std::is_pod<Params>::value, "Params is not POD");

auto double_normal_sample = c10::optional<double>();

// Construct the state of at::CPUGenerator based on input byte tensor size.
if (THTensor_(nElement)(self) == size_legacy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks dangerous to me. If you ever modify THGeneratorStateNew so that its size coincides with THGeneratorState, you have a really nasty bug on your hand where we'll read in the RNG from the wrong direction.

I suppose I could be convinced that this situation is incredibly unlikely to happen, since it's not every day you're going to muck around with the rng state. But at the very least, this warrants a static assert that the sizeof THGeneratorState doesn't match sizeof THGeneratorStateNew

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are serious about catching these errors, we need some sort of header that we can use to identify the format of the RNG state.

static const size_t size = sizeof(at::CPUGenerator);
at::CPUGenerator* rng_state;
THArgCheck(THTensor_(nElement)(self) == size, 1, "RNG state is wrong size");
auto cast_generator = static_cast<at::CPUGenerator*>(_generator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't catch this previously, but this now looks a little skeevy to me. Can you guarantee that we actually have a CPUGenerator at this point in time? If we can't guarantee it, you better have a check.

auto r = legacy_state->normal_rho;
auto theta = 2.0 * M_PI * legacy_state->normal_x;
// we return the sin version of the normal sample when in caching mode
double_normal_sample = c10::optional<double>(r * ::sin(theta));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually give you identical behavior to the old sampling behavior? I'm not convinced, because you only set double_normal_sample here, and not float_normal_sample, but I imagine the old state was utilized both for float and double sampling. (I don't think there is necessarily anything wrong with this, but when it comes to BC guarantees, we should call a spade a spade, and say what we are actually preserving, and what we are not. What we seem to have here is "best effort" BC--we won't crash, but the random numbers won't necessarily be the same.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would get identical behavior, since I actually haven't utilized the float versions that I'm storing and have used at::normal_distribution<double> throughout (since that's what being done in TH right now, THRandom_normal returns a double, and is utilized for both float and double dispatch).

// update next_double_normal_sample
double_normal_sample = rng_state->next_double_normal_sample;
} else {
AT_ERROR("RNG state is wrong size");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to put more diagnostic info here: how big was the RNG state, and what were the valid sizes?

@ezyang
Copy link
Contributor

ezyang commented Apr 24, 2019

I consider the POD problem and the lack of static assert on size between THGeneratorState{,New} to be land blocking.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@syed-ahmed
Copy link
Collaborator Author

@ezyang I have addressed your latest review
Changes are

  • Fixed the POD issues, added a mt19937_data_pod to get mt19937 data in and out, broke c10::optional into bool and float
  • Added static asserts on sizes (let's do the header solution in the future, when we can move get_rng_state, set_rng_state, natively in ATen) and added diagnostics in error messages
  • Added check for CPUGenerator in get/set_rng_state.

Regarding the testing of old checkpoint: the rng binary looks big: https://gist.github.com/syed-ahmed/fc2f3635a02cba12342c204d54754960#gistcomment-2897557
Let me know whether you still want this in the test suite.

@syed-ahmed syed-ahmed requested a review from ezyang April 24, 2019 23:43
@ezyang
Copy link
Contributor

ezyang commented Apr 25, 2019

Regarding the testing of old checkpoint: the rng binary looks big: https://gist.github.com/syed-ahmed/fc2f3635a02cba12342c204d54754960#gistcomment-2897557
Let me know whether you still want this in the test suite.

OK, let's not :)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@ezyang ezyang dismissed their stale review April 25, 2019 13:28

fixed

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Apr 26, 2019

Hey @syed-ahmed, here's an update on the status of landing this PR: the diff itself is passing PyTorch/Caffe2 tests, but there are some C++ BC-breaking changes which I will need to fix upstream. Unfortunately, I'll be on PTO next week, so I won't get to it until the week after.

TensorImpl.cpp:44:28: error: ‘class at::Context’ has no member named ‘defaultGenerator’
     torch::globalContext().defaultGenerator(torch::kCPU);
                            ^~~~~~~~~~~~~~~~
TensorImpl.cpp:45:11: error: ‘struct at::Generator’ has no member named ‘manualSeed’; did you mean ‘manualSeedAll’?
   cpu_gen.manualSeed(seed);
           ^~~~~~~~~~
TensorImpl.cpp:44:28: error: ‘class at::Context’ has no member named ‘defaultGenerator’
     torch::globalContext().defaultGenerator(torch::kCPU);
                            ^~~~~~~~~~~~~~~~
TensorImpl.cpp:45:11: error: ‘struct at::Generator’ has no member named ‘manualSeed’; did you mean ‘manualSeedAll’?
   cpu_gen.manualSeed(seed);
           ^~~~~~~~~~

Additionally, the change in RNG has perturbed a few tests. I don't remember if this PR is supposed to produce exactly the same random numbers or not. We should probably fix these tests (esp. since you're planning to change the RNG engine), but if this was NOT expected for this diff, we should investigate more.

tensor1 (torch.Size([4, 4])) and tensor2 (torch.Size([4, 4])) are not close enough. 
max rtol: 0.39336044	max atol: 0.30065852

and

    self.assertAlmostEqual(output, -25.48, places=2)
AssertionError: -26.65626863950947 != -25.48 within 2 places

@ezyang
Copy link
Contributor

ezyang commented Apr 26, 2019

@syed-ahmed I think a good idea is to write "how to migrate your code" release notes (and if we can't write these migration notes in a reasonable way, we should add back some methods for BC). Things to cover:

  • the torch::Generator type is different now, right? If you previously wrote torch::Generator& x = ... in your code, what should I write now?
  • If any methods from at::Context were removed, document their replacements
  • If any methods on what previously was torch::Generator changed, document their replacements

One thing that makes this tricky is that this PR only does CPU, and not CUDA. So I'm not sure how many BC problems being only partially refactored introduces.

@syed-ahmed
Copy link
Collaborator Author

Additionally, the change in RNG has perturbed a few tests. I don't remember if this PR is supposed to produce exactly the same random numbers or not. We should probably fix these tests (esp. since you're planning to change the RNG engine), but if this was NOT expected for this diff, we should investigate more.

This PR is supposed to produce exactly the same random numbers. One reason for the failing tests could be how I'm reading from /dev/random. If there are no torch.manual_seed() in the failing tests, that would mean that the seed is set by getNonDeterministicRandom()

uint64_t getNonDeterministicRandom() {
  std::random_device rd;
  uint32_t random1;
  uint32_t random2;
  if (rd.entropy() != 0) {
    random1 = rd();
    random2 = rd();
    return make64BitsFrom32Bits(random1, random2);
  }
  else {
    random1 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    random2 = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    return make64BitsFrom32Bits(random1, random2);
  }
}

Previously, we did this with: https://github.com/pytorch/pytorch/blob/master/aten/src/TH/THRandom.cpp#L62, which just static casted a 32 bit number from time(0), whereas I concat two 32 bit numbers in my function. I could either just change getNonDeterministicRandom() to do the same static cast and then you can run the internal CI again, or a solution for the failing test in the internal CI could be to set a manual_seed at the beginning of the test.

I am looking into writing the migration notes and will ping again about the tricky part - So I'm not sure how many BC problems being only partially refactored introduces

@ezyang
Copy link
Contributor

ezyang commented May 8, 2019

I could either just change getNonDeterministicRandom() to do the same static cast and then you can run the internal CI again, or a solution for the failing test in the internal CI could be to set a manual_seed at the beginning of the test.

I'm OK with running this test (if you paste a diff in here I can apply it), but the explanation you've given me doesn't make sense to me. If the internal tests were previously seeding based on the current time, I would expect them to be similarly flaky (under the assumption that "bad" seeds are uniformly distributed among the space of 64-bit integers. I suppose it's possible that actually they're not distributed this way, but that still seems pretty weird).

Thank you for the migration notes; it will also help me fix the internal use sites :) (Or they might also suggest to you that we can improve BC by reintroducing some old functions but just deprecating them.)

@ezyang
Copy link
Contributor

ezyang commented May 8, 2019

I did a quick check to see if we were generating the same random numbers.

master

(/home/ezyang/Dev/pytorch-tmp-env) [ezyang@devgpu005.ash6 ~/Dev/pytorch-tmp] python                     
Python 3.7.2 (default, Dec 29 2018, 06:19:36)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.                                  
>>> import torch
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7f21d1694af0>
>>> torch.randn(10)
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845, -1.3986,  0.4033,  0.8380,                         
        -0.7193, -0.4033])

your PR

(/home/ezyang/Dev/pytorch-tmp-env) [ezyang@devgpu005.ash6 ~/Dev/pytorch-tmp] python
Python 3.7.2 (default, Dec 29 2018, 06:19:36) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7f5b08b74ab0>
>>> torch.randn(10)
tensor([ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845, -1.3986,  0.4033,  0.8380,
        -0.7193, -0.4033])

So, at least normal generation looks like it's fine. (Could it be a problem in one of the other RNG generators? Hard to say...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: random Related to random number generation in PyTorch (rng generator) open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants