Skip to content

Conversation

@InnovArul
Copy link
Contributor

the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631

The rng state is updated (unnecessarily?) by the test data loader, even if the option shuffle=False. This may hinder reproducibility.

@pytorchbot pytorchbot added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label May 21, 2019
Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct and makes rng generation depending on num_workers. I don't know why you would assume rng to be unchanged for different code. Finally, this seed has nothing to do with shuffle.

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

Waiting for lint to pass

@ssnl
Copy link
Collaborator

ssnl commented May 21, 2019

We specifically moved that line out of the if condition in a previous pr.

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

This is not correct and makes rng generation depending on num_workers. I don't know why you would assume rng to be unchanged for different code. Finally, this seed has nothing to do with shuffle.

I don't follow. The random vector appears to be unused when num_workers == 0

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

I guess the in question is PR #7265

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

cc @vfdev-5 @apaszke

@vfdev-5
Copy link
Contributor

vfdev-5 commented May 21, 2019

The aim of #7265 was two-fold 1) to have a reproducible behaviour of batch sampler (when shuffle=True) for any number of workers and 2) have a possibility to skip previous batches by simply running multiple times next on the batch sampler (context: resuming crashed training and DataLoader state from a given iteration).

I agree with the remark that rng state should not be changed while running dataloader with shuffle=False. @InnovArul however proposed fix will still change rng if num_worker > 0 in the test dataloader, non ?

@InnovArul
Copy link
Contributor Author

InnovArul commented May 21, 2019

@vfdev-5 you are right! the fix would not work if num_workers > 0.
What do you think of this suggestion?

  • initialize base_seed in the constructor of Dataloader itself
  • access it from _DataLoaderIter (loader.base_seed)

I am not sure if it affects the old fixes.

@vfdev-5
Copy link
Contributor

vfdev-5 commented May 21, 2019

@InnovArul maybe there is another way (without introducing a supplementary variable) to check this by testing isinstance(loader.sampler, SequentialSampler).
I thought also maybe it would be helpful later to introduce the test of all those behaviors:
a) check indices when shuffle=True in two cases: num_workers=0 and num_workers > 0 (example)
b) check rng unchanged after test loader execution.

@ezyang @ssnl what do you think about all of this ?

@InnovArul
Copy link
Contributor Author

  • Yes, the requirement is to not to update rng state if shuffle=False regardless of num_workers.
    Testing the sampler variable seems to be a good fix as well.

  • Also, if we introduce a supplementary variable, the 2nd aim of Fix issue #7209 in DataLoader #7265 (have a possibility to skip previous batches) is getting jeopardized. So, introducing a new variable may not be a good option.

@ssnl
Copy link
Collaborator

ssnl commented May 21, 2019

Yes, the requirement is to not to update rng state if shuffle=False regardless of num_workers.
Testing the sampler variable seems to be a good fix as well.

I am really not sure about this. The seed generated here is completely unrelated to shuffle, and is necessary for multiprocessing loading, even if you use sequential sampler.

@InnovArul
Copy link
Contributor Author

InnovArul commented May 21, 2019

I understand. shuffle has nothing to do with base_seed.

The use-case is as follows (from the pytorch forum thread):

  • Set manual seed for torch, numpy etc.
  • usually, we initialize dataloaders with: shuffle=True for train data loader, shuffle=False for validation and test data loaders.
  • then call methods in a loop : 1) train(train_dataloader), 2) validation(val_dataloader), 3) test(test_dataloader) - optional.
  • we expect train_dataloader to produce same set of indices regardless of the validation(), test() function calls. (i.e., if only validation() is called (or) if both validation(), test() are called)

With the current implementation, we are not achieving this behavior, as the rng state is modified by val_dataloader, test_dataloader by default.

@vfdev-5
Copy link
Contributor

vfdev-5 commented May 21, 2019

The seed generated here is completely unrelated to shuffle, and is necessary for multiprocessing loading, even if you use sequential sampler.

Oh, yes, that's true, I forgot about that:

def _worker_loop(dataset, index_queue, data_queue, done_event, collate_fn, seed, init_fn, worker_id):
# See NOTE [ Data Loader Multiprocessing Shutdown Logic ] for details on the
# logic of this function.
try:
collate._use_shared_memory = True
# Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal
# module's handlers are executed after Python returns from C low-level
# handlers, likely when the same fatal signal had already happened
# again.
# https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers
signal_handling._set_worker_signal_handlers()
torch.set_num_threads(1)
random.seed(seed)
torch.manual_seed(seed)

@ssnl does it make sense to make :

random.seed(seed) 
torch.manual_seed(seed)

optional in case of shuffle=False ?

we expect train_dataloader to produce same set of indices regardless of whether we call validation(), test() functions.

@InnovArul I would say that the requirement is only that train_loader produces same indices when prefixed with torch.manual_seed(seed)...

@InnovArul
Copy link
Contributor Author

@vfdev-5 thats right! I forgot to mention the setting of manual seed.

@InnovArul
Copy link
Contributor Author

Just checking with you guys to update/close the request before we forget this.
Do I need to change anything with the code? / Do you think this change is unnecessary or so?

@ezyang
Copy link
Contributor

ezyang commented May 31, 2019

OK, I finally read through all of the background information. At the end of day, the root question is whether or not you expect Dataloader (without shuffle) to consume a random number. It seems that the expectation of a lot of people in this thread is that it should not. But we have to compute a base seed for the workers (to ensure that they are deterministic), and it should be randomized every time you call Dataloader, so it does seem like the Dataloader rng call is merited. So I agree with @ssnl that we shouldn't move the base seed in the conditional; better for us to just unconditionally consume the seed, as it makes the contract regarding how many RNG numbers the function consumes more clear. So we will not accept this PR.

However, I think there are some useful follow PRs that we could accept.

  • We could add documentation to Dataloader making it more clear that it consumes RNG, even when shuffle is not set.
  • We could add a new seed keyword argument to the constructor, letting you explicitly specify the starting seed for the subjobs (letting you skip the RNG call)

So if you want to make a PR for any of these things, CC me and I'll review and land them.

For the particular user problem here, if he really liked his results from pipeline one, I'd suggest resetting the rng as suggested in https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631/24 is a good workaround (if the seed keyword was added, this wouldn't be necessary; just pass a hardcoded seed for your workers). Alternately, you could unconditionally construct the Dataloader object to ensure the RNG is consumed in the same way in both cases.

@InnovArul
Copy link
Contributor Author

InnovArul commented May 31, 2019

@ezyang Thanks for the detailed reply!
I will do the documentation and adding base_seed parameter to the constructor.

Before that, I would like to get a clarification,

we have to compute a base seed for the workers (to ensure that they are deterministic), and it should be randomized every time you call Dataloader

Do you mean, a separate seed has to be computed every time when the Dataloader __iter__ is called?
(or) Do you mean that the seed has to be computed only once per Dataloader?

I understand from your pointer to add base_seed argument to the Dataloaders contructor that the seed is fixed per Dataloader instance. If so, can we move the base_seed determination (unless base_seed is passed by the user) to the Dataloader constructor(__init__) and clearly document that Dataloader construction consumes one RNG state? (as mentioned here). Does it make sense?

@ssnl
Copy link
Collaborator

ssnl commented May 31, 2019

I understand from your pointer to add base_seed argument to the Dataloaders contructor that the seed is fixed per Dataloader instance. If so, can we move the base_seed determination (unless base_seed is passed by the user) to the Dataloader constructor(init) and clearly document that Dataloader construction consumes one RNG state? (as mentioned here). Does it make sense?

This definitely can not happen, as it would make every epoch generate the exact same random augmentations. However, I think the better way is to accept a torch.Generator object (that is defaulted as torch.default_generator).

But even then I have some concerns that the added argument would be confusing, because it only controls the randomness for that particular seed, and does not affect Samplers. Our Samplers currently do not take in custom generators (neither CPU nor CUDA).

@ezyang
Copy link
Contributor

ezyang commented May 31, 2019 via email

@InnovArul
Copy link
Contributor Author

ok. Thanks both. It seems RNG consumption is unavoidable even during Shuffle=False due to the need for random transformations. I will try to add it in the documentation.

We already have a mention about base_seed in the Dataloader's documentation. I will add a line nearby that.

By default, each worker will have its PyTorch seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG.

Brennan Vincent and others added 6 commits June 20, 2019 12:36
Summary:
pytorch#21579 correctly points out an inaccuracy in the docs for `arange`.
Pull Request resolved: pytorch#21992

Differential Revision: D15914411

Pulled By: umanwizard

fbshipit-source-id: 3eb1734b29af3f3858f0f4d54c71e28dbda5c75b
Reviewed By: yns88

fbshipit-source-id: 0be0694d6adf1ae9baa408a4b372101a26a14ba4
Summary:
Pull Request resolved: pytorch#22025
ghimport-source-id: 0fb408a

Test Plan: Imported from OSS

Differential Revision: D15926651

Pulled By: ezyang

fbshipit-source-id: 298340bfbfe44dcd81cde8f0d56f8dbde92fb7bd
Summary:
Pull Request resolved: pytorch#20673

Add option to bucket-weighted pooling to hash the bucket so that any cardinality score can be used.

Reviewed By: huginhuangfb

Differential Revision: D15003509

fbshipit-source-id: 575a149de395f18fd7759f3edb485619f8aa5363
…feb6b4 (pytorch#22030)

Summary:
Pull Request resolved: pytorch#22030

Previous import was dd599b05f424eb161a31f3e059566a33310dbe5e

Included changes:
- **[355a4954](onnx/onnx@355a4954)**: Update codeowners to have community folder changes assigned to steering committee (pytorch#2104) <Prasanth Pulavarthi>
- **[ceaa5da7](onnx/onnx@ceaa5da7)**: Fix Resize/Upsample Shape inference function (pytorch#2085) <Raymond Yang>
- **[4de8dc0d](onnx/onnx@4de8dc0d)**: Clarify shape inference requirements for new operators (pytorch#2088) <Hariharan Seshadri>
- **[52aa1fad](onnx/onnx@52aa1fad)**: Fix NN defs file (pytorch#2083) <Hariharan Seshadri>

Reviewed By: bddppq

Differential Revision: D15924221

fbshipit-source-id: 91ba64ef3e1a2de4a7dd0b02ee6393508cc44a73
Summary:
This is a modified version of pytorch#14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes pytorch#17909, pytorch#18096, pytorch#19946, and some of pytorch#13023
Pull Request resolved: pytorch#19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
pietern and others added 14 commits July 3, 2019 02:01
Summary:
If the current CUDA device is not the same as the device that hosts
the tensor the operation works on then OpenMPI will segfault, as
reported in pytorch#21922. This changes adds a device guard for every
operation to ensure the correct device is set.

Fixes pytorch#21922.
Pull Request resolved: pytorch#22446

Differential Revision: D16106823

Pulled By: pietern

fbshipit-source-id: 99d762eb3851c0a0e0b4fe81cf27c1c8d35596cc
…ch#22348)

Summary:
Pull Request resolved: pytorch#22348

This is the last step of LRU hash eviction weight re-init. This diff checks if there's evicted values in sparse_lookup, if so call op created in D15709866 to re-init the values for indicies in evicted_values. Also created gradient op for the operator. The gradient op just passes the output gradient as input gradient.

Reviewed By: itomatik

Differential Revision: D16044736

fbshipit-source-id: 9afb85209b0de1038c5153bcb7dfc5f52e0b2abb
Summary:
Pull Request resolved: pytorch#22499

Another place where onnx export is running dead code elimination after making the jit graph invalid. Fixing it.

Reviewed By: houseroad

Differential Revision: D16111969

fbshipit-source-id: 5ba80340c06d091988858077f142ea4e3da0638c
Summary:
Pull Request resolved: pytorch#21432

This diff introduce a new interface to generate tests based on the metadata of operators.

Reviewed By: ajauhri

Differential Revision: D15675542

fbshipit-source-id: ba60e803ea553d8b9eb6cb2bcdc6a0368ef62b1c
Summary:
- Fix typo in ```torch/onnx/utils.py``` when looking up registered custom ops.
- Add a simple test case
    1. Register custom op with ```TorchScript``` using ```cpp_extension.load_inline```.
    2. Register custom op with ```torch.onnx.symbolic``` using ```register_custom_op_symbolic```.
    3. Export model with custom op, and verify with Caffe2 backend.
Pull Request resolved: pytorch#21321

Differential Revision: D16101097

Pulled By: houseroad

fbshipit-source-id: 084f8b55e230e1cb6e9bd7bd52d7946cefda8e33
Summary:
Pull Request resolved: pytorch#22468

as title

Reviewed By: ajauhri

Differential Revision: D15744503

fbshipit-source-id: 050b32dd7f135512385fc04f098c376c664211a9
Summary:
## Fix pytorch#22192

+ change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)`
+ change cuda & cuda head
```cuda
std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean,
                                                        const Tensor& running_var, double momentum, double epsilon, int64_t count) {
                                                        const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
+ change python interface
```python
class SyncBatchNorm(Function):
    def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size):
        ...
```
Pull Request resolved: pytorch#22248

Differential Revision: D16002146

Pulled By: mrshenli

fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
Summary:
Pull Request resolved: pytorch#22309

This diff enables PT operators to run with JIT mode. Users can control eager and JIT mode using the `use_jit` flag.

In this diff, we are putting operators in a loop and passed it to JIT. One extra step which wraps the operator with the `_consume` op is introduced to avoid dead code elimination optimization in JIT.  With that, the reported time includes the real operator execution time plus the `_consume` (directly return input, nothing else if happening inside) op.

Reviewed By: zheng-xq

Differential Revision: D16033082

fbshipit-source-id: e03be89fd5a505e44e81015dfc63db9cd76fb8a1
Summary:
* Deletes all weak script decorators / associated data structures / methods
   * In order to keep supporting the standard library in script, this enables recursive script on any function defined in `torch.nn`
   * Most changes in `torch/nn` are the result of `ag -Q "weak" torch/nn/ -l | xargs sed -i '/weak/d'`, only `rnn.py` needed manual editing to use the `ignore` and `export` to continue supporting the overloaded `forward` methods
* `Sequential`/`ModuleList` no longer need to be added to constants since they are compiled on demand

This should also fix pytorch#22212
Pull Request resolved: pytorch#22212

Differential Revision: D15988346

Pulled By: driazati

fbshipit-source-id: af223e3ad0580be895377312949997a70e988e4f
… D16079809 (pytorch#22456)

Summary:
re-apply changes reverted in:
pytorch#22412

Also change log_softmax to take positional arguments. Long-term we do want the kwarg-only interface, but seems to currently be incompatible with jit serialization.
Pull Request resolved: pytorch#22456

Differential Revision: D16097159

Pulled By: nairbv

fbshipit-source-id: 8cb73e9ca18fc66b35b873cf4a574b167a578b3d
Summary: Pull Request resolved: pytorch#22275

Test Plan: Imported from OSS

Differential Revision: D16060597

Pulled By: wanchaol

fbshipit-source-id: a11d8ad3b037e15bd670cc7cd3fefd4f0abd0bba
Summary: Pull Request resolved: pytorch#22509

Differential Revision: D16119307

Pulled By: Krovatkin

fbshipit-source-id: 3251f594be6d365652b847bdc35dd4f4b62c35e6
@pytorchbot pytorchbot added module: ci Related to continuous integration module: docs Related to our documentation, both in docs/ and docblocks module: infra Relates to CI infrastructure module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 4, 2019
@InnovArul
Copy link
Contributor Author

I rebased my repo to latest pytorch to update this pull request. But it has messed up the history.
I will create another pull request for it. Hence, closing this pull request. Sorry.

@InnovArul InnovArul closed this Jul 4, 2019
facebook-github-bot pushed a commit that referenced this pull request Jul 8, 2019
#22540)

Summary:
the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631

The discussion is here: #20749
Pull Request resolved: #22540

Differential Revision: D16131777

Pulled By: ezyang

fbshipit-source-id: 566deda1b44dc7fae54250e9b508d120851a2848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: ci Related to continuous integration module: dataloader Related to torch.utils.data.DataLoader and Sampler module: docs Related to our documentation, both in docs/ and docblocks module: infra Relates to CI infrastructure module: pybind Related to our Python bindings / interactions with other Python libraries open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.