-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Prevent hanging in data loader altogether #11985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @csarofeen I remember that you said there is still occasional hang. This might fix it. |
|
Thanks for keeping me posted, I'm checking now to see if I can still get any hangs. I am getting quite a few errors on cleanup (disclaimer this is on a ~month old master + most of your changes, so take with a grain of salt for the moment). |
|
Besides the error, I am seeing one of the big hangs I was fighting with is gone. Thanks @ssnl! Will also check on a newer master commit. |
|
@csarofeen Thanks for testing out. I am aware of that error. I think it is due to that during python exit, when the main process tries to get from worker in There is another hang related to this. I think it is due to that we set |
|
I'm getting a test failure now... Sorry, I see this is already showing up in the CI. |
|
@csarofeen Yep, I will fix that. There are still some tricky parts to be fixed. |
e08b452 to
3771ee4
Compare
|
I firmly believe that this should work in all cases. And empirically it has even lowered the data loading shutdown overhead a bit for me when training LeNets on MNIST. I'm training 9000 LeNets on multiple processes to see if any still hangs. (Previously I can't train 200 without seeing hanging on one rank). |
|
@pytorchbot retest this please |
|
@pytorchbot retest this please |
1 similar comment
|
@pytorchbot retest this please |
test/test_dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finished a review, but I have a bunch of comments and need to board a plane now.
test/test_dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@ssnl I'm sorry but NOTE [ Data Loader Multiprocessing Shutdown Logic ] that you wrote is somewhat confusing to me... Firstly, did you mean this process.py code in CPython, when you said this?
And if that's the case, is it really related to Python 3.7 or later? Because I can see the similar code in previous versions. Thank you in advance! |
|
@bombs-kim Yes I mean that code. It is not specific to python 3.7 or later. |
Test plan:
Trained 16000 LeNets (total) on MNIST on 160 processes. Previously some of them will either hang during training or during program exiting. Now every process finishes successfully.
Original desc:
In
DataLoaderIter__del__, ensure thatNoneis sent topin_memory_threadbefore joining workers.Trace when interrupted at such a hang:
The 1st commit solves majority of the hang, but uncovers another problem: