Skip to content

Conversation

@bombs-kim
Copy link
Contributor

@bombs-kim bombs-kim commented Dec 13, 2018

Related PR

Fixed non-determinate preprocessing on DataLoader #4640

Content

I read the PR above and has always expected that determinism is ensured if I do manual_seed before making a data loader. Recently I encountered a situation where the determinism seemed broken. It took me a while to figure out what went wrong, and I learned that the random number generation was conditioned on worker_id, hence the number of workers of a loader. I thought it was implicit and hard to expect, so I fixed it.

An example

import random
from datetime import datetime
import torch
from torch.utils.data import Dataset, DataLoader


class RandomDataset(Dataset):
    def __getitem__(self, item):
        # return torch.LongTensor([random.randint(0, 9999)])
        return torch.LongTensor(1).random_() % 10000

    def __len__(self):
        return 11


def train_load(dset, num_workers):
    # Simulate train loading
    torch.manual_seed(0)  # Fix seed
    loader = DataLoader(dset, num_workers=num_workers)
    print([data.item() for data in loader])


dset = RandomDataset()

for i in range(3):
    train_load(dset, 1)
    
train_load(dset, 2)
train_load(dset, 3)
train_load(dset, 4)
# previous (torch 1.0.0)
[8378, 3997, 9259, 9542, 4167, 9875, 2475, 4502, 2503, 2990, 5795]
[8378, 3997, 9259, 9542, 4167, 9875, 2475, 4502, 2503, 2990, 5795]
[8378, 3997, 9259, 9542, 4167, 9875, 2475, 4502, 2503, 2990, 5795]
[8378, 9306, 3997, 4917, 9259, 9434, 9542, 3405, 4167, 2219, 9875]
[8378, 9306, 2250, 3997, 4917, 3659, 9259, 9434, 716, 9542, 3405]
[8378, 9306, 2250, 5787, 3997, 4917, 3659, 2776, 9259, 9434, 716]
# this PR
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]
[8378, 9306, 2250, 5787, 5415, 2162, 2645, 4624, 1128, 9602, 7094]

@ssnl Please have a look at this PR

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. You can't assume that indices are integers. You are not doing similar things for single process data loading. Seeding also isn't free, especially if the dataset uses GPU. I feel that the current behavior is reasonable, and what we should do is to make it clear in the docs.

@bombs-kim
Copy link
Contributor Author

bombs-kim commented Dec 13, 2018

@ssnl Sorry, but why did you think I assumed that indices are integers? I couldn't find which part of my code depends on the condition that indexes be integers. And I can fix "You are not doing similar things for single process data loading" thing, too.

@bombs-kim
Copy link
Contributor Author

bombs-kim commented Dec 13, 2018

@ssnl One another big point of this PR is that, with this implementation we don't have to keep multiple index_queues any more. It's related to @apaszke and @colesbury 's concerns in PR #4640. The problem was that the benchmark testing in that PR was too naive and it does seem to deteriorate performance for me. I'm playing with it now, and found some cases where performance actually went bad not so trivially.

It was surprised by the fact that seeding is not so cheap, and tried to make sure of it, but I failed to make any visible differences. We may need to compare costs for two scenarios. I plan to make a subsequent commits that may demonstrate performance boost, if you can help me with this PR.

I really appreciate your comment anyhow! Thanks

prev_getitem = dataset.__getitem__

def getitem(self, idx):
random.seed(base_seed + idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are assuming that index is an integer.

Copy link
Contributor Author

@bombs-kim bombs-kim Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, idx doesn't have to be an integer. See the result of the code below

>> random.seed(1.1);print(random.random())
0.22415427849366876
>> random.seed(1.1);print(random.random())
0.22415427849366876
>>random.seed(1.2);print(random.random())
0.24751702373640583

According to the official python docs

If a(seed) is not None or an int or a long, then hash(a) is used instead

And here is the link to the implementation

But, it does seems that torch.manual_seed requires an integer argument. I think you can you some hash function like the builtin hash in this case.

torch.manual_seed(hash(seed))

@ssnl
Copy link
Collaborator

ssnl commented Dec 14, 2018

The problem was that the benchmark testing in that PR was too naive and it does seem to deteriorate performance for me. I'm playing with it now, and found some cases where performance actually went bad not so trivially.

Could you come up with an example along with some benchmark numbers?

It was surprised by the fact that seeding is not so cheap,

Seeding is expensive if you are using CUDA.

@bombs-kim
Copy link
Contributor Author

@ssnl After I had a conversation with you and learned that torch.manual_seed() is expensive with cuda, I decided that this PR is not worth it any more. It seems that manual seeding for each datapoint is not a good idea to make a deterministically random dataset. Thank you, and I'm happy to close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants