Determinism breaks when the number of workers of a DataLoader #15170

bombs-kim · 2018-12-13T14:13:34Z

Related PR

Fixed non-determinate preprocessing on DataLoader #4640

Content

I read the PR above and has always expected that determinism is ensured if I do manual_seed before making a data loader. Recently I encountered a situation where the determinism seemed broken. It took me a while to figure out what went wrong, and I learned that the random number generation was conditioned on worker_id, hence the number of workers of a loader. I thought it was implicit and hard to expect, so I fixed it.

An example

import random
from datetime import datetime
import torch
from torch.utils.data import Dataset, DataLoader


class RandomDataset(Dataset):
    def __getitem__(self, item):
        # return torch.LongTensor([random.randint(0, 9999)])
        return torch.LongTensor(1).random_() % 10000

    def __len__(self):
        return 11


def train_load(dset, num_workers):
    # Simulate train loading
    torch.manual_seed(0)  # Fix seed
    loader = DataLoader(dset, num_workers=num_workers)
    print([data.item() for data in loader])


dset = RandomDataset()

for i in range(3):
    train_load(dset, 1)
    
train_load(dset, 2)
train_load(dset, 3)
train_load(dset, 4)

# previous (torch 1.0.0)
[8378, 3997, 9259, 9542, 4167, 9875, 2475, 4502, 2503, 2990, 5795]
[8378, 3997, 9259, 9542, 4167, 9875, 2475, 4502, 2503, 2990, 5795]
[8378, 3997, 9259, 9542, 4167, 9875, 2475, 4502, 2503, 2990, 5795]
[8378, 9306, 3997, 4917, 9259, 9434, 9542, 3405, 4167, 2219, 9875]
[8378, 9306, 2250, 3997, 4917, 3659, 9259, 9434, 716, 9542, 3405]
[8378, 9306, 2250, 5787, 3997, 4917, 3659, 2776, 9259, 9434, 716]

# this PR
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]

@ssnl Please have a look at this PR

…taLoader changes

ssnl

I don't think this is correct. You can't assume that indices are integers. You are not doing similar things for single process data loading. Seeding also isn't free, especially if the dataset uses GPU. I feel that the current behavior is reasonable, and what we should do is to make it clear in the docs.

bombs-kim · 2018-12-13T15:19:29Z

@ssnl Sorry, but why did you think I assumed that indices are integers? I couldn't find which part of my code depends on the condition that indexes be integers. And I can fix "You are not doing similar things for single process data loading" thing, too.

bombs-kim · 2018-12-13T16:06:58Z

@ssnl One another big point of this PR is that, with this implementation we don't have to keep multiple index_queues any more. It's related to @apaszke and @colesbury 's concerns in PR #4640. The problem was that the benchmark testing in that PR was too naive and it does seem to deteriorate performance for me. I'm playing with it now, and found some cases where performance actually went bad not so trivially.

It was surprised by the fact that seeding is not so cheap, and tried to make sure of it, but I failed to make any visible differences. We may need to compare costs for two scenarios. I plan to make a subsequent commits that may demonstrate performance boost, if you can help me with this PR.

I really appreciate your comment anyhow! Thanks

ssnl · 2018-12-14T02:59:08Z

torch/utils/data/dataloader.py

+    prev_getitem = dataset.__getitem__
+
+    def getitem(self, idx):
+        random.seed(base_seed + idx)


Here you are assuming that index is an integer.

Actually, idx doesn't have to be an integer. See the result of the code below

>> random.seed(1.1);print(random.random()) 0.22415427849366876 >> random.seed(1.1);print(random.random()) 0.22415427849366876 >>random.seed(1.2);print(random.random()) 0.24751702373640583

According to the official python docs

If a(seed) is not None or an int or a long, then hash(a) is used instead

And here is the link to the implementation

But, it does seems that torch.manual_seed requires an integer argument. I think you can you some hash function like the builtin hash in this case.

torch.manual_seed(hash(seed))

ssnl · 2018-12-14T03:00:42Z

The problem was that the benchmark testing in that PR was too naive and it does seem to deteriorate performance for me. I'm playing with it now, and found some cases where performance actually went bad not so trivially.

Could you come up with an example along with some benchmark numbers?

It was surprised by the fact that seeding is not so cheap,

Seeding is expensive if you are using CUDA.

bombs-kim · 2018-12-14T09:31:49Z

@ssnl After I had a conversation with you and learned that torch.manual_seed() is expensive with cuda, I decided that this PR is not worth it any more. It seems that manual seeding for each datapoint is not a good idea to make a deterministically random dataset. Thank you, and I'm happy to close this PR

bombs-kim added 2 commits December 13, 2018 22:41

problem fixed: Determinism breaks when the number of workers for a Da…

d151455

…taLoader changes

trivial comment correction

a6de6cc

ssnl requested changes Dec 13, 2018

View reviewed changes

ssnl reviewed Dec 14, 2018

View reviewed changes

soumith closed this Dec 15, 2018

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Determinism breaks when the number of workers of a DataLoader #15170

Determinism breaks when the number of workers of a DataLoader #15170

Uh oh!

bombs-kim commented Dec 13, 2018 •

edited

Loading

Uh oh!

ssnl left a comment

Uh oh!

bombs-kim commented Dec 13, 2018 •

edited

Loading

Uh oh!

bombs-kim commented Dec 13, 2018 •

edited

Loading

Uh oh!

ssnl Dec 14, 2018

Uh oh!

bombs-kim Dec 14, 2018 •

edited

Loading

Uh oh!

ssnl commented Dec 14, 2018

Uh oh!

bombs-kim commented Dec 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Determinism breaks when the number of workers of a DataLoader #15170

Determinism breaks when the number of workers of a DataLoader #15170

Uh oh!

Conversation

bombs-kim commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related PR

Content

An example

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

bombs-kim commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bombs-kim commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

bombs-kim Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssnl commented Dec 14, 2018

Uh oh!

bombs-kim commented Dec 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bombs-kim commented Dec 13, 2018 •

edited

Loading

bombs-kim commented Dec 13, 2018 •

edited

Loading

bombs-kim commented Dec 13, 2018 •

edited

Loading

bombs-kim Dec 14, 2018 •

edited

Loading