-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Dataloader issues #4643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader issues #4643
Conversation
|
@pytorchbot retest this please |
45ef2e1 to
6637a67
Compare
|
uhh I'll find a better way to test |
869f4dc to
773dac6
Compare
|
Okay, here we go. This should work. Writing a test to work on all versions of Python was such a pain. |
|
Whoo, so fancy! |
test/test_dataloader.py
Outdated
| # make SIGUSR1 interrupt | ||
| def handler(signum, frame): | ||
| pass | ||
| signal.signal(signal.SIGUSR1, handler) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| static void handler_SIGTERM(int sig, siginfo_t *info, void *ctx) | ||
| { | ||
| if (info->si_pid == getppid()) { | ||
| _exit(EXIT_SUCCESS); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
| // processes first before deleting the loader object. Then the cleaning up | ||
| // methods in DataLoader.__del__ are not yet called, and SIGCHILD will print an | ||
| // error saying a worker is killed by SIGTERM. So we suppress SIGTERM from main | ||
| // loader process here to avoid this. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| # Syscalls are automatically retried upon encountering EINTR since Python 3.5 | ||
| # https://www.python.org/dev/peps/pep-0475/ | ||
| # EINTR is not available on Windows. | ||
| if sys.platform == 'win32' or sys.version_info >= (3, 5): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
| p.send_signal(signal.SIGUSR1) | ||
| p.join(0.5) | ||
| self.assertTrue(p.is_alive()) | ||
| p.join(JOIN_TIMEOUT) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
| finally: | ||
| p.terminate() | ||
| # Case 2: timeout | ||
| timeout = 2 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
| p = ErrorTrackingProcess(target=_test_segfault) | ||
| p.start() | ||
| p.join(JOIN_TIMEOUT) | ||
| p.join(3.0 + JOIN_TIMEOUT) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
| # The used socket.recv call below in the replacing function is quite | ||
| # dangerous because it blocks everything on Python side, including the | ||
| # cleaning up in dataloder.__del__ when this process exits. To prevent | ||
| # orphan worker child, we manually terminate worker process here. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Addressed comments. |
|
@apaszke Could you review the new changes when you have time please? Thanks :) |
|
@ssnl sorry will try tomorrow |
test/test_dataloader.py
Outdated
| data_queue = dataloaderiter.data_queue | ||
| data = data_queue.get() # ensure that worker handlers are installed | ||
| for w in dataloaderiter.workers: | ||
| w.terminate() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
| p.join(0.5) | ||
| self.assertTrue(p.is_alive()) | ||
| except OSError as e: | ||
| self.fail("DataLoader shouldn't fail due to interrupted syscall") |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/utils.py
Outdated
| if timeout is not None: | ||
| timeout -= time_fn() - t0 | ||
| if timeout <= 0: | ||
| raise queue.Empty |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/utils.py
Outdated
| def put(self, val): | ||
| while True: | ||
| try: | ||
| return self.queue.put(val) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
| data_queue.queue.get = interruptable_get | ||
| except Exception: | ||
| # in Python >= 3.5 or on Windows QueueWrapper is not a class but a no-op | ||
| pass |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Fixes
signal.signalbeing called on child threads.Sometimes error in dataloader methods causes the trace to be polluted as in RuntimeError: DataLoader worker (pid 23616) is killed by signal: Terminated. #4507 . After investigating this with @ezyang , we realize the reason is that when an exception is thrown,
sometimes worker processes are terminated before the
Dataloaderis deleted (likely because that the trace contains a reference to theDataloader)then the workers got SIGTERM from loader before the clean up methods are called in
Dataloader.__del__then loader's SIGCHLD handler is triggered, and it sees that workers fail, so it reports a RuntimeError with polluted trace as shown in RuntimeError: DataLoader worker (pid 23616) is killed by signal: Terminated. #4507 .
As a solution, this PR suppresses SIGTERM from loader in worker SIGTERM handler, and
exit(0)in such case.Improves some dataloder tests so they use a subclass of
multiprocessing.Processthat tracks the last error, and checks the error message.Prevents unnecessary cuda context initialization in
Dataloaderbytorch.cuda.current_device()in casepin_memory=False.Always call
_remove_worker_pidsinDataloader.shutdownso the set on cpp side inDataloder.cppis always updated correctly even whenDataloader.shutdownerrors.Fixes #4507 .
Verified that windows tests pass.