-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Determinism breaks when the number of workers of a DataLoader #15170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ssnl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. You can't assume that indices are integers. You are not doing similar things for single process data loading. Seeding also isn't free, especially if the dataset uses GPU. I feel that the current behavior is reasonable, and what we should do is to make it clear in the docs.
|
@ssnl Sorry, but why did you think I assumed that indices are integers? I couldn't find which part of my code depends on the condition that indexes be integers. And I can fix "You are not doing similar things for single process data loading" thing, too. |
|
@ssnl One another big point of this PR is that, with this implementation we don't have to keep multiple It was surprised by the fact that seeding is not so cheap, and tried to make sure of it, but I failed to make any visible differences. We may need to compare costs for two scenarios. I plan to make a subsequent commits that may demonstrate performance boost, if you can help me with this PR. I really appreciate your comment anyhow! Thanks |
| prev_getitem = dataset.__getitem__ | ||
|
|
||
| def getitem(self, idx): | ||
| random.seed(base_seed + idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you are assuming that index is an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, idx doesn't have to be an integer. See the result of the code below
>> random.seed(1.1);print(random.random())
0.22415427849366876
>> random.seed(1.1);print(random.random())
0.22415427849366876
>>random.seed(1.2);print(random.random())
0.24751702373640583According to the official python docs
If a(seed) is not None or an int or a long, then hash(a) is used instead
And here is the link to the implementation
But, it does seems that torch.manual_seed requires an integer argument. I think you can you some hash function like the builtin hash in this case.
torch.manual_seed(hash(seed))
Could you come up with an example along with some benchmark numbers?
Seeding is expensive if you are using CUDA. |
|
@ssnl After I had a conversation with you and learned that torch.manual_seed() is expensive with cuda, I decided that this PR is not worth it any more. It seems that manual seeding for each datapoint is not a good idea to make a deterministically random dataset. Thank you, and I'm happy to close this PR |
Related PR
Fixed non-determinate preprocessing on DataLoader #4640
Content
I read the PR above and has always expected that determinism is ensured if I do
manual_seedbefore making a data loader. Recently I encountered a situation where the determinism seemed broken. It took me a while to figure out what went wrong, and I learned that the random number generation was conditioned onworker_id, hence the number of workers of a loader. I thought it was implicit and hard to expect, so I fixed it.An example
@ssnl Please have a look at this PR