torch.utils.data.Dataloader: base seed is determined inside if loop #20749

InnovArul · 2019-05-21T08:55:15Z

the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631

The rng state is updated (unnecessarily?) by the test data loader, even if the option shuffle=False. This may hinder reproducibility.

ssnl

This is not correct and makes rng generation depending on num_workers. I don't know why you would assume rng to be unchanged for different code. Finally, this seed has nothing to do with shuffle.

ezyang · 2019-05-21T19:02:03Z

Waiting for lint to pass

ssnl · 2019-05-21T19:02:16Z

We specifically moved that line out of the if condition in a previous pr.

ezyang · 2019-05-21T19:03:12Z

This is not correct and makes rng generation depending on num_workers. I don't know why you would assume rng to be unchanged for different code. Finally, this seed has nothing to do with shuffle.

I don't follow. The random vector appears to be unused when num_workers == 0

ezyang · 2019-05-21T19:34:24Z

I guess the in question is PR #7265

ezyang · 2019-05-21T19:35:05Z

cc @vfdev-5 @apaszke

vfdev-5 · 2019-05-21T20:27:10Z

The aim of #7265 was two-fold 1) to have a reproducible behaviour of batch sampler (when shuffle=True) for any number of workers and 2) have a possibility to skip previous batches by simply running multiple times next on the batch sampler (context: resuming crashed training and DataLoader state from a given iteration).

I agree with the remark that rng state should not be changed while running dataloader with shuffle=False. @InnovArul however proposed fix will still change rng if num_worker > 0 in the test dataloader, non ?

InnovArul · 2019-05-21T20:54:44Z

@vfdev-5 you are right! the fix would not work if num_workers > 0.
What do you think of this suggestion?

initialize base_seed in the constructor of Dataloader itself
access it from _DataLoaderIter (loader.base_seed)

I am not sure if it affects the old fixes.

vfdev-5 · 2019-05-21T21:15:48Z

@InnovArul maybe there is another way (without introducing a supplementary variable) to check this by testing isinstance(loader.sampler, SequentialSampler).
I thought also maybe it would be helpful later to introduce the test of all those behaviors:
a) check indices when shuffle=True in two cases: num_workers=0 and num_workers > 0 (example)
b) check rng unchanged after test loader execution.

@ezyang @ssnl what do you think about all of this ?

InnovArul · 2019-05-21T21:30:00Z

Yes, the requirement is to not to update rng state if shuffle=False regardless of num_workers.
Testing the sampler variable seems to be a good fix as well.
Also, if we introduce a supplementary variable, the 2nd aim of Fix issue #7209 in DataLoader #7265 (have a possibility to skip previous batches) is getting jeopardized. So, introducing a new variable may not be a good option.

ssnl · 2019-05-21T21:34:29Z

Yes, the requirement is to not to update rng state if shuffle=False regardless of num_workers.
Testing the sampler variable seems to be a good fix as well.

I am really not sure about this. The seed generated here is completely unrelated to shuffle, and is necessary for multiprocessing loading, even if you use sequential sampler.

InnovArul · 2019-05-21T22:03:25Z

I understand. shuffle has nothing to do with base_seed.

The use-case is as follows (from the pytorch forum thread):

Set manual seed for torch, numpy etc.
usually, we initialize dataloaders with: shuffle=True for train data loader, shuffle=False for validation and test data loaders.
then call methods in a loop : 1) train(train_dataloader), 2) validation(val_dataloader), 3) test(test_dataloader) - optional.
we expect train_dataloader to produce same set of indices regardless of the validation(), test() function calls. (i.e., if only validation() is called (or) if both validation(), test() are called)

With the current implementation, we are not achieving this behavior, as the rng state is modified by val_dataloader, test_dataloader by default.

vfdev-5 · 2019-05-21T22:18:03Z

The seed generated here is completely unrelated to shuffle, and is necessary for multiprocessing loading, even if you use sequential sampler.

Oh, yes, that's true, I forgot about that:

pytorch/torch/utils/data/_utils/worker.py

Lines 58 to 75 in 47dc65f

    
           def _worker_loop(dataset, index_queue, data_queue, done_event, collate_fn, seed, init_fn, worker_id): 
        
               # See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the 
        
               # logic of this function. 
        
               try: 
        
                   collate._use_shared_memory = True 
        
                   # Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal 
        
                   # module's handlers are executed after Python returns from C low-level 
        
                   # handlers, likely when the same fatal signal had already happened 
        
                   # again. 
        
                   # https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers 
        
                   signal_handling._set_worker_signal_handlers() 
        
                   torch.set_num_threads(1) 
        
                   random.seed(seed) 
        
                   torch.manual_seed(seed)

@ssnl does it make sense to make :

random.seed(seed) 
torch.manual_seed(seed)

optional in case of shuffle=False ?

we expect train_dataloader to produce same set of indices regardless of whether we call validation(), test() functions.

@InnovArul I would say that the requirement is only that train_loader produces same indices when prefixed with torch.manual_seed(seed)...

InnovArul · 2019-05-21T22:23:37Z

@vfdev-5 thats right! I forgot to mention the setting of manual seed.

InnovArul · 2019-05-30T20:00:35Z

Just checking with you guys to update/close the request before we forget this.
Do I need to change anything with the code? / Do you think this change is unnecessary or so?

ezyang · 2019-05-31T15:57:20Z

OK, I finally read through all of the background information. At the end of day, the root question is whether or not you expect Dataloader (without shuffle) to consume a random number. It seems that the expectation of a lot of people in this thread is that it should not. But we have to compute a base seed for the workers (to ensure that they are deterministic), and it should be randomized every time you call Dataloader, so it does seem like the Dataloader rng call is merited. So I agree with @ssnl that we shouldn't move the base seed in the conditional; better for us to just unconditionally consume the seed, as it makes the contract regarding how many RNG numbers the function consumes more clear. So we will not accept this PR.

However, I think there are some useful follow PRs that we could accept.

We could add documentation to Dataloader making it more clear that it consumes RNG, even when shuffle is not set.
We could add a new seed keyword argument to the constructor, letting you explicitly specify the starting seed for the subjobs (letting you skip the RNG call)

So if you want to make a PR for any of these things, CC me and I'll review and land them.

For the particular user problem here, if he really liked his results from pipeline one, I'd suggest resetting the rng as suggested in https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631/24 is a good workaround (if the seed keyword was added, this wouldn't be necessary; just pass a hardcoded seed for your workers). Alternately, you could unconditionally construct the Dataloader object to ensure the RNG is consumed in the same way in both cases.

InnovArul · 2019-05-31T17:52:39Z

@ezyang Thanks for the detailed reply!
I will do the documentation and adding base_seed parameter to the constructor.

Before that, I would like to get a clarification,

we have to compute a base seed for the workers (to ensure that they are deterministic), and it should be randomized every time you call Dataloader

Do you mean, a separate seed has to be computed every time when the Dataloader __iter__ is called?
(or) Do you mean that the seed has to be computed only once per Dataloader?

I understand from your pointer to add base_seed argument to the Dataloaders contructor that the seed is fixed per Dataloader instance. If so, can we move the base_seed determination (unless base_seed is passed by the user) to the Dataloader constructor(__init__) and clearly document that Dataloader construction consumes one RNG state? (as mentioned here). Does it make sense?

ssnl · 2019-05-31T18:40:53Z

I understand from your pointer to add base_seed argument to the Dataloaders contructor that the seed is fixed per Dataloader instance. If so, can we move the base_seed determination (unless base_seed is passed by the user) to the Dataloader constructor(init) and clearly document that Dataloader construction consumes one RNG state? (as mentioned here). Does it make sense?

This definitely can not happen, as it would make every epoch generate the exact same random augmentations. However, I think the better way is to accept a torch.Generator object (that is defaulted as torch.default_generator).

But even then I have some concerns that the added argument would be confusing, because it only controls the randomness for that particular seed, and does not affect Samplers. Our Samplers currently do not take in custom generators (neither CPU nor CUDA).

ezyang · 2019-05-31T19:45:36Z

Yes, SsnL's concern here seems legitimate and I am not sure there is a way to solve that problem. So we might have to satisfy ourselves with docs. Excerpts from Tongzhou Wang's message of 2019-05-31 11:41:55 -0700:

…

> I understand from your pointer to add base_seed argument to the Dataloaders contructor that the seed is fixed per Dataloader instance. If so, can we move the base_seed determination (unless base_seed is passed by the user) to the Dataloader constructor(__init__) and clearly document that Dataloader construction consumes one RNG state? (as mentioned here). Does it make sense? This definitely can not happen, as it would make every epoch generate the exact same random augmentations. However, I think the better way is to accept a `torch.Generator` object (that is defaulted as `torch.default_generator`). But even then I have some concerns that the added argument would be confusing, because it only controls the randomness for that particular seed, and does not affect `Sampler`s. Our `Sampler`s currently do not take in custom generators (neither CPU nor CUDA).

InnovArul · 2019-06-01T09:36:09Z

ok. Thanks both. It seems RNG consumption is unavoidable even during Shuffle=False due to the need for random transformations. I will try to add it in the documentation.

We already have a mention about base_seed in the Dataloader's documentation. I will add a line nearby that.

By default, each worker will have its PyTorch seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG.

Summary: pytorch#21579 correctly points out an inaccuracy in the docs for `arange`. Pull Request resolved: pytorch#21992 Differential Revision: D15914411 Pulled By: umanwizard fbshipit-source-id: 3eb1734b29af3f3858f0f4d54c71e28dbda5c75b

Reviewed By: yns88 fbshipit-source-id: 0be0694d6adf1ae9baa408a4b372101a26a14ba4

Summary: Pull Request resolved: pytorch#22025 ghimport-source-id: 0fb408a Test Plan: Imported from OSS Differential Revision: D15926651 Pulled By: ezyang fbshipit-source-id: 298340bfbfe44dcd81cde8f0d56f8dbde92fb7bd

Summary: Pull Request resolved: pytorch#20673 Add option to bucket-weighted pooling to hash the bucket so that any cardinality score can be used. Reviewed By: huginhuangfb Differential Revision: D15003509 fbshipit-source-id: 575a149de395f18fd7759f3edb485619f8aa5363

…feb6b4 (pytorch#22030) Summary: Pull Request resolved: pytorch#22030 Previous import was dd599b05f424eb161a31f3e059566a33310dbe5e Included changes: - **[355a4954](onnx/onnx@355a4954)**: Update codeowners to have community folder changes assigned to steering committee (pytorch#2104) <Prasanth Pulavarthi> - **[ceaa5da7](onnx/onnx@ceaa5da7)**: Fix Resize/Upsample Shape inference function (pytorch#2085) <Raymond Yang> - **[4de8dc0d](onnx/onnx@4de8dc0d)**: Clarify shape inference requirements for new operators (pytorch#2088) <Hariharan Seshadri> - **[52aa1fad](onnx/onnx@52aa1fad)**: Fix NN defs file (pytorch#2083) <Hariharan Seshadri> Reviewed By: bddppq Differential Revision: D15924221 fbshipit-source-id: 91ba64ef3e1a2de4a7dd0b02ee6393508cc44a73

Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde

Summary: If the current CUDA device is not the same as the device that hosts the tensor the operation works on then OpenMPI will segfault, as reported in pytorch#21922. This changes adds a device guard for every operation to ensure the correct device is set. Fixes pytorch#21922. Pull Request resolved: pytorch#22446 Differential Revision: D16106823 Pulled By: pietern fbshipit-source-id: 99d762eb3851c0a0e0b4fe81cf27c1c8d35596cc

…ch#22348) Summary: Pull Request resolved: pytorch#22348 This is the last step of LRU hash eviction weight re-init. This diff checks if there's evicted values in sparse_lookup, if so call op created in D15709866 to re-init the values for indicies in evicted_values. Also created gradient op for the operator. The gradient op just passes the output gradient as input gradient. Reviewed By: itomatik Differential Revision: D16044736 fbshipit-source-id: 9afb85209b0de1038c5153bcb7dfc5f52e0b2abb

Summary: Pull Request resolved: pytorch#22499 Another place where onnx export is running dead code elimination after making the jit graph invalid. Fixing it. Reviewed By: houseroad Differential Revision: D16111969 fbshipit-source-id: 5ba80340c06d091988858077f142ea4e3da0638c

Summary: Pull Request resolved: pytorch#21432 This diff introduce a new interface to generate tests based on the metadata of operators. Reviewed By: ajauhri Differential Revision: D15675542 fbshipit-source-id: ba60e803ea553d8b9eb6cb2bcdc6a0368ef62b1c

Summary: - Fix typo in ```torch/onnx/utils.py``` when looking up registered custom ops. - Add a simple test case 1. Register custom op with ```TorchScript``` using ```cpp_extension.load_inline```. 2. Register custom op with ```torch.onnx.symbolic``` using ```register_custom_op_symbolic```. 3. Export model with custom op, and verify with Caffe2 backend. Pull Request resolved: pytorch#21321 Differential Revision: D16101097 Pulled By: houseroad fbshipit-source-id: 084f8b55e230e1cb6e9bd7bd52d7946cefda8e33

Summary: Pull Request resolved: pytorch#22468 as title Reviewed By: ajauhri Differential Revision: D15744503 fbshipit-source-id: 050b32dd7f135512385fc04f098c376c664211a9

Summary: ## Fix pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956

Summary: Pull Request resolved: pytorch#22309 This diff enables PT operators to run with JIT mode. Users can control eager and JIT mode using the `use_jit` flag. In this diff, we are putting operators in a loop and passed it to JIT. One extra step which wraps the operator with the `_consume` op is introduced to avoid dead code elimination optimization in JIT. With that, the reported time includes the real operator execution time plus the `_consume` (directly return input, nothing else if happening inside) op. Reviewed By: zheng-xq Differential Revision: D16033082 fbshipit-source-id: e03be89fd5a505e44e81015dfc63db9cd76fb8a1

Summary: * Deletes all weak script decorators / associated data structures / methods * In order to keep supporting the standard library in script, this enables recursive script on any function defined in `torch.nn` * Most changes in `torch/nn` are the result of `ag -Q "weak" torch/nn/ -l | xargs sed -i '/weak/d'`, only `rnn.py` needed manual editing to use the `ignore` and `export` to continue supporting the overloaded `forward` methods * `Sequential`/`ModuleList` no longer need to be added to constants since they are compiled on demand This should also fix pytorch#22212 Pull Request resolved: pytorch#22212 Differential Revision: D15988346 Pulled By: driazati fbshipit-source-id: af223e3ad0580be895377312949997a70e988e4f

… D16079809 (pytorch#22456) Summary: re-apply changes reverted in: pytorch#22412 Also change log_softmax to take positional arguments. Long-term we do want the kwarg-only interface, but seems to currently be incompatible with jit serialization. Pull Request resolved: pytorch#22456 Differential Revision: D16097159 Pulled By: nairbv fbshipit-source-id: 8cb73e9ca18fc66b35b873cf4a574b167a578b3d

Summary: Pull Request resolved: pytorch#22275 Test Plan: Imported from OSS Differential Revision: D16060597 Pulled By: wanchaol fbshipit-source-id: a11d8ad3b037e15bd670cc7cd3fefd4f0abd0bba

Summary: Pull Request resolved: pytorch#22509 Differential Revision: D16119307 Pulled By: Krovatkin fbshipit-source-id: 3251f594be6d365652b847bdc35dd4f4b62c35e6

InnovArul · 2019-07-04T17:42:15Z

I rebased my repo to latest pytorch to update this pull request. But it has messed up the history.
I will create another pull request for it. Hence, closing this pull request. Sorry.

#22540) Summary: the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631 The discussion is here: #20749 Pull Request resolved: #22540 Differential Revision: D16131777 Pulled By: ezyang fbshipit-source-id: 566deda1b44dc7fae54250e9b508d120851a2848

pytorchbot added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label May 21, 2019

InnovArul mentioned this pull request May 21, 2019

Problem with DataLoader #20717

Closed

ezyang approved these changes May 21, 2019

View reviewed changes

ssnl requested changes May 21, 2019

View reviewed changes

ezyang added the open source label Jun 5, 2019

Brennan Vincent and others added 6 commits June 20, 2019 12:36

correct arange docs (pytorch#21992)

4cd7d78

Summary: pytorch#21579 correctly points out an inaccuracy in the docs for `arange`. Pull Request resolved: pytorch#21992 Differential Revision: D15914411 Pulled By: umanwizard fbshipit-source-id: 3eb1734b29af3f3858f0f4d54c71e28dbda5c75b

Updating submodules

19ef157

Reviewed By: yns88 fbshipit-source-id: 0be0694d6adf1ae9baa408a4b372101a26a14ba4

Fix 'error : detail is ambiguous' on Windows (pytorch#22025)

1aae4b0

Summary: Pull Request resolved: pytorch#22025 ghimport-source-id: 0fb408a Test Plan: Imported from OSS Differential Revision: D15926651 Pulled By: ezyang fbshipit-source-id: 298340bfbfe44dcd81cde8f0d56f8dbde92fb7bd

pietern and others added 14 commits July 3, 2019 02:01

create tensor based on provided datatype (pytorch#22468)

325ec23

Summary: Pull Request resolved: pytorch#22468 as title Reviewed By: ajauhri Differential Revision: D15744503 fbshipit-source-id: 050b32dd7f135512385fc04f098c376c664211a9

move casting ops from prim to aten

799633e

Summary: Pull Request resolved: pytorch#22275 Test Plan: Imported from OSS Differential Revision: D16060597 Pulled By: wanchaol fbshipit-source-id: a11d8ad3b037e15bd670cc7cd3fefd4f0abd0bba

remove erase_fork_wait.h

6f6a680

Summary: Pull Request resolved: pytorch#22509 Differential Revision: D16119307 Pulled By: Krovatkin fbshipit-source-id: 3251f594be6d365652b847bdc35dd4f4b62c35e6

note about RNG state is added for dataloader

08f9437

note about RNG state is added for dataloader

8236558

InnovArul requested review from apaszke, ebetica, fmassa, goldsborough, mrshenli, pietern, soumith and yf225 as code owners July 4, 2019 17:29

pytorchbot added module: ci Related to continuous integration module: docs Related to our documentation, both in docs/ and docblocks module: infra Relates to CI infrastructure module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 4, 2019

InnovArul closed this Jul 4, 2019

InnovArul mentioned this pull request Jul 4, 2019

torch.utils.data.Dataloader: documentation about RNG state consumption #22540

Closed

torch.utils.data.Dataloader: base seed is determined inside if loop #20749

torch.utils.data.Dataloader: base seed is determined inside if loop #20749

Uh oh!

Conversation

InnovArul commented May 21, 2019

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented May 21, 2019

Uh oh!

ssnl commented May 21, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

vfdev-5 commented May 21, 2019

Uh oh!

InnovArul commented May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented May 21, 2019

Uh oh!

InnovArul commented May 21, 2019

Uh oh!

ssnl commented May 21, 2019

Uh oh!

InnovArul commented May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented May 21, 2019

Uh oh!

InnovArul commented May 21, 2019

Uh oh!

InnovArul commented May 30, 2019

Uh oh!

ezyang commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

InnovArul commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented May 31, 2019

Uh oh!

ezyang commented May 31, 2019 via email

Uh oh!

InnovArul commented Jun 1, 2019

Uh oh!

InnovArul commented Jul 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

InnovArul commented May 21, 2019 •

edited

Loading

InnovArul commented May 21, 2019 •

edited

Loading

ezyang commented May 31, 2019 •

edited

Loading

InnovArul commented May 31, 2019 •

edited

Loading