-
Notifications
You must be signed in to change notification settings - Fork 26.3k
torch.utils.data.Dataloader: base seed is determined inside if loop #20749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ssnl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct and makes rng generation depending on num_workers. I don't know why you would assume rng to be unchanged for different code. Finally, this seed has nothing to do with shuffle.
|
Waiting for lint to pass |
|
We specifically moved that line out of the |
I don't follow. The random vector appears to be unused when |
|
I guess the in question is PR #7265 |
|
The aim of #7265 was two-fold 1) to have a reproducible behaviour of batch sampler (when I agree with the remark that rng state should not be changed while running dataloader with |
|
@vfdev-5 you are right! the fix would not work if
I am not sure if it affects the old fixes. |
|
@InnovArul maybe there is another way (without introducing a supplementary variable) to check this by testing |
|
I am really not sure about this. The seed generated here is completely unrelated to |
|
I understand. The use-case is as follows (from the pytorch forum thread):
With the current implementation, we are not achieving this behavior, as the |
Oh, yes, that's true, I forgot about that: pytorch/torch/utils/data/_utils/worker.py Lines 58 to 75 in 47dc65f
@ssnl does it make sense to make : random.seed(seed)
torch.manual_seed(seed)optional in case of
@InnovArul I would say that the requirement is only that |
|
@vfdev-5 thats right! I forgot to mention the setting of manual seed. |
|
Just checking with you guys to update/close the request before we forget this. |
|
OK, I finally read through all of the background information. At the end of day, the root question is whether or not you expect Dataloader (without shuffle) to consume a random number. It seems that the expectation of a lot of people in this thread is that it should not. But we have to compute a base seed for the workers (to ensure that they are deterministic), and it should be randomized every time you call Dataloader, so it does seem like the Dataloader rng call is merited. So I agree with @ssnl that we shouldn't move the base seed in the conditional; better for us to just unconditionally consume the seed, as it makes the contract regarding how many RNG numbers the function consumes more clear. So we will not accept this PR. However, I think there are some useful follow PRs that we could accept.
So if you want to make a PR for any of these things, CC me and I'll review and land them. For the particular user problem here, if he really liked his results from pipeline one, I'd suggest resetting the rng as suggested in https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631/24 is a good workaround (if the seed keyword was added, this wouldn't be necessary; just pass a hardcoded seed for your workers). Alternately, you could unconditionally construct the Dataloader object to ensure the RNG is consumed in the same way in both cases. |
|
@ezyang Thanks for the detailed reply! Before that, I would like to get a clarification,
Do you mean, a separate seed has to be computed every time when the Dataloader I understand from your pointer to add |
This definitely can not happen, as it would make every epoch generate the exact same random augmentations. However, I think the better way is to accept a But even then I have some concerns that the added argument would be confusing, because it only controls the randomness for that particular seed, and does not affect |
|
Yes, SsnL's concern here seems legitimate and I am not sure there is a
way to solve that problem. So we might have to satisfy ourselves with
docs.
Excerpts from Tongzhou Wang's message of 2019-05-31 11:41:55 -0700:
… > I understand from your pointer to add base_seed argument to the Dataloaders contructor that the seed is fixed per Dataloader instance. If so, can we move the base_seed determination (unless base_seed is passed by the user) to the Dataloader constructor(__init__) and clearly document that Dataloader construction consumes one RNG state? (as mentioned here). Does it make sense?
This definitely can not happen, as it would make every epoch generate the exact same random augmentations. However, I think the better way is to accept a `torch.Generator` object (that is defaulted as `torch.default_generator`).
But even then I have some concerns that the added argument would be confusing, because it only controls the randomness for that particular seed, and does not affect `Sampler`s. Our `Sampler`s currently do not take in custom generators (neither CPU nor CUDA).
|
|
ok. Thanks both. It seems RNG consumption is unavoidable even during We already have a mention about
|
Summary: pytorch#21579 correctly points out an inaccuracy in the docs for `arange`. Pull Request resolved: pytorch#21992 Differential Revision: D15914411 Pulled By: umanwizard fbshipit-source-id: 3eb1734b29af3f3858f0f4d54c71e28dbda5c75b
Reviewed By: yns88 fbshipit-source-id: 0be0694d6adf1ae9baa408a4b372101a26a14ba4
Summary: Pull Request resolved: pytorch#22025 ghimport-source-id: 0fb408a Test Plan: Imported from OSS Differential Revision: D15926651 Pulled By: ezyang fbshipit-source-id: 298340bfbfe44dcd81cde8f0d56f8dbde92fb7bd
Summary: Pull Request resolved: pytorch#20673 Add option to bucket-weighted pooling to hash the bucket so that any cardinality score can be used. Reviewed By: huginhuangfb Differential Revision: D15003509 fbshipit-source-id: 575a149de395f18fd7759f3edb485619f8aa5363
…feb6b4 (pytorch#22030) Summary: Pull Request resolved: pytorch#22030 Previous import was dd599b05f424eb161a31f3e059566a33310dbe5e Included changes: - **[355a4954](onnx/onnx@355a4954)**: Update codeowners to have community folder changes assigned to steering committee (pytorch#2104) <Prasanth Pulavarthi> - **[ceaa5da7](onnx/onnx@ceaa5da7)**: Fix Resize/Upsample Shape inference function (pytorch#2085) <Raymond Yang> - **[4de8dc0d](onnx/onnx@4de8dc0d)**: Clarify shape inference requirements for new operators (pytorch#2088) <Hariharan Seshadri> - **[52aa1fad](onnx/onnx@52aa1fad)**: Fix NN defs file (pytorch#2083) <Hariharan Seshadri> Reviewed By: bddppq Differential Revision: D15924221 fbshipit-source-id: 91ba64ef3e1a2de4a7dd0b02ee6393508cc44a73
Summary: This is a modified version of pytorch#14705 since commit structure for that PR is quite messy. 1. Add `IterableDataset`. 3. So we have 2 data loader mods: `Iterable` and `Map`. 1. `Iterable` if the `dataset` is an instance of `IterableDataset` 2. `Map` o.w. 3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading. 3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`. 4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration. 5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`. 7. Import torch.utils.data in `torch/__init__.py` 9. data loader examples and documentations 10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate` Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023 Pull Request resolved: pytorch#19228 Reviewed By: bddppq Differential Revision: D15058152 fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Summary: If the current CUDA device is not the same as the device that hosts the tensor the operation works on then OpenMPI will segfault, as reported in pytorch#21922. This changes adds a device guard for every operation to ensure the correct device is set. Fixes pytorch#21922. Pull Request resolved: pytorch#22446 Differential Revision: D16106823 Pulled By: pietern fbshipit-source-id: 99d762eb3851c0a0e0b4fe81cf27c1c8d35596cc
…ch#22348) Summary: Pull Request resolved: pytorch#22348 This is the last step of LRU hash eviction weight re-init. This diff checks if there's evicted values in sparse_lookup, if so call op created in D15709866 to re-init the values for indicies in evicted_values. Also created gradient op for the operator. The gradient op just passes the output gradient as input gradient. Reviewed By: itomatik Differential Revision: D16044736 fbshipit-source-id: 9afb85209b0de1038c5153bcb7dfc5f52e0b2abb
Summary: Pull Request resolved: pytorch#22499 Another place where onnx export is running dead code elimination after making the jit graph invalid. Fixing it. Reviewed By: houseroad Differential Revision: D16111969 fbshipit-source-id: 5ba80340c06d091988858077f142ea4e3da0638c
Summary: Pull Request resolved: pytorch#21432 This diff introduce a new interface to generate tests based on the metadata of operators. Reviewed By: ajauhri Differential Revision: D15675542 fbshipit-source-id: ba60e803ea553d8b9eb6cb2bcdc6a0368ef62b1c
Summary:
- Fix typo in ```torch/onnx/utils.py``` when looking up registered custom ops.
- Add a simple test case
1. Register custom op with ```TorchScript``` using ```cpp_extension.load_inline```.
2. Register custom op with ```torch.onnx.symbolic``` using ```register_custom_op_symbolic```.
3. Export model with custom op, and verify with Caffe2 backend.
Pull Request resolved: pytorch#21321
Differential Revision: D16101097
Pulled By: houseroad
fbshipit-source-id: 084f8b55e230e1cb6e9bd7bd52d7946cefda8e33
Summary: Pull Request resolved: pytorch#22468 as title Reviewed By: ajauhri Differential Revision: D15744503 fbshipit-source-id: 050b32dd7f135512385fc04f098c376c664211a9
Summary: ## Fix pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
Summary: Pull Request resolved: pytorch#22309 This diff enables PT operators to run with JIT mode. Users can control eager and JIT mode using the `use_jit` flag. In this diff, we are putting operators in a loop and passed it to JIT. One extra step which wraps the operator with the `_consume` op is introduced to avoid dead code elimination optimization in JIT. With that, the reported time includes the real operator execution time plus the `_consume` (directly return input, nothing else if happening inside) op. Reviewed By: zheng-xq Differential Revision: D16033082 fbshipit-source-id: e03be89fd5a505e44e81015dfc63db9cd76fb8a1
Summary: * Deletes all weak script decorators / associated data structures / methods * In order to keep supporting the standard library in script, this enables recursive script on any function defined in `torch.nn` * Most changes in `torch/nn` are the result of `ag -Q "weak" torch/nn/ -l | xargs sed -i '/weak/d'`, only `rnn.py` needed manual editing to use the `ignore` and `export` to continue supporting the overloaded `forward` methods * `Sequential`/`ModuleList` no longer need to be added to constants since they are compiled on demand This should also fix pytorch#22212 Pull Request resolved: pytorch#22212 Differential Revision: D15988346 Pulled By: driazati fbshipit-source-id: af223e3ad0580be895377312949997a70e988e4f
… D16079809 (pytorch#22456) Summary: re-apply changes reverted in: pytorch#22412 Also change log_softmax to take positional arguments. Long-term we do want the kwarg-only interface, but seems to currently be incompatible with jit serialization. Pull Request resolved: pytorch#22456 Differential Revision: D16097159 Pulled By: nairbv fbshipit-source-id: 8cb73e9ca18fc66b35b873cf4a574b167a578b3d
Summary: Pull Request resolved: pytorch#22275 Test Plan: Imported from OSS Differential Revision: D16060597 Pulled By: wanchaol fbshipit-source-id: a11d8ad3b037e15bd670cc7cd3fefd4f0abd0bba
Summary: Pull Request resolved: pytorch#22509 Differential Revision: D16119307 Pulled By: Krovatkin fbshipit-source-id: 3251f594be6d365652b847bdc35dd4f4b62c35e6
|
I rebased my repo to latest pytorch to update this pull request. But it has messed up the history. |
#22540) Summary: the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631 The discussion is here: #20749 Pull Request resolved: #22540 Differential Revision: D16131777 Pulled By: ezyang fbshipit-source-id: 566deda1b44dc7fae54250e9b508d120851a2848
the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631
The rng state is updated (unnecessarily?) by the test data loader, even if the option shuffle=False. This may hinder reproducibility.