Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jan 12, 2018

  1. Fixes signal.signal being called on child threads.

  2. Sometimes error in dataloader methods causes the trace to be polluted as in RuntimeError: DataLoader worker (pid 23616) is killed by signal: Terminated. #4507 . After investigating this with @ezyang , we realize the reason is that when an exception is thrown,

    1. sometimes worker processes are terminated before the Dataloader is deleted (likely because that the trace contains a reference to the Dataloader)

    2. then the workers got SIGTERM from loader before the clean up methods are called in Dataloader.__del__

    3. then loader's SIGCHLD handler is triggered, and it sees that workers fail, so it reports a RuntimeError with polluted trace as shown in RuntimeError: DataLoader worker (pid 23616) is killed by signal: Terminated. #4507 .

    As a solution, this PR suppresses SIGTERM from loader in worker SIGTERM handler, and exit(0) in such case.

  3. Improves some dataloder tests so they use a subclass of multiprocessing.Process that tracks the last error, and checks the error message.

  4. Prevents unnecessary cuda context initialization in Dataloader by torch.cuda.current_device() in case pin_memory=False.

  5. Always call _remove_worker_pids in Dataloader.shutdown so the set on cpp side in Dataloder.cpp is always updated correctly even when Dataloader.shutdown errors.

Fixes #4507 .

Verified that windows tests pass.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 12, 2018

@pytorchbot retest this please

@ssnl ssnl closed this Jan 12, 2018
@ssnl ssnl reopened this Jan 12, 2018
@ssnl ssnl force-pushed the dl branch 2 times, most recently from 45ef2e1 to 6637a67 Compare January 12, 2018 21:11
@ssnl
Copy link
Collaborator Author

ssnl commented Jan 12, 2018

uhh I'll find a better way to test

@ssnl ssnl force-pushed the dl branch 2 times, most recently from 869f4dc to 773dac6 Compare January 13, 2018 05:02
@ssnl
Copy link
Collaborator Author

ssnl commented Jan 13, 2018

Okay, here we go. This should work. Writing a test to work on all versions of Python was such a pain.

@ezyang
Copy link
Contributor

ezyang commented Jan 18, 2018

Whoo, so fancy!

# make SIGUSR1 interrupt
def handler(signum, frame):
pass
signal.signal(signal.SIGUSR1, handler)

This comment was marked as off-topic.

This comment was marked as off-topic.

static void handler_SIGTERM(int sig, siginfo_t *info, void *ctx)
{
if (info->si_pid == getppid()) {
_exit(EXIT_SUCCESS);

This comment was marked as off-topic.

This comment was marked as off-topic.

// processes first before deleting the loader object. Then the cleaning up
// methods in DataLoader.__del__ are not yet called, and SIGCHILD will print an
// error saying a worker is killed by SIGTERM. So we suppress SIGTERM from main
// loader process here to avoid this.

This comment was marked as off-topic.

This comment was marked as off-topic.

# Syscalls are automatically retried upon encountering EINTR since Python 3.5
# https://www.python.org/dev/peps/pep-0475/
# EINTR is not available on Windows.
if sys.platform == 'win32' or sys.version_info >= (3, 5):

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

p.send_signal(signal.SIGUSR1)
p.join(0.5)
self.assertTrue(p.is_alive())
p.join(JOIN_TIMEOUT)

This comment was marked as off-topic.

This comment was marked as off-topic.

finally:
p.terminate()
# Case 2: timeout
timeout = 2

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

p = ErrorTrackingProcess(target=_test_segfault)
p.start()
p.join(JOIN_TIMEOUT)
p.join(3.0 + JOIN_TIMEOUT)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

# The used socket.recv call below in the replacing function is quite
# dangerous because it blocks everything on Python side, including the
# cleaning up in dataloder.__del__ when this process exits. To prevent
# orphan worker child, we manually terminate worker process here.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 19, 2018

Addressed comments.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 24, 2018

@apaszke Could you review the new changes when you have time please? Thanks :)

@apaszke
Copy link
Contributor

apaszke commented Jan 24, 2018

@ssnl sorry will try tomorrow

data_queue = dataloaderiter.data_queue
data = data_queue.get() # ensure that worker handlers are installed
for w in dataloaderiter.workers:
w.terminate()

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

p.join(0.5)
self.assertTrue(p.is_alive())
except OSError as e:
self.fail("DataLoader shouldn't fail due to interrupted syscall")

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

if timeout is not None:
timeout -= time_fn() - t0
if timeout <= 0:
raise queue.Empty

This comment was marked as off-topic.

def put(self, val):
while True:
try:
return self.queue.put(val)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

data_queue.queue.get = interruptable_get
except Exception:
# in Python >= 3.5 or on Windows QueueWrapper is not a class but a no-op
pass

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl ssnl changed the title Retry queue.put/get upon EINTR and fix dataloader trace Dataloader issues Jan 25, 2018
@soumith soumith merged commit 64a9eca into pytorch:master Jan 29, 2018
@ssnl ssnl deleted the dl branch January 29, 2018 05:48
@soumith soumith added the 0.3.1 label Feb 7, 2018
ssnl added a commit to ssnl/pytorch that referenced this pull request Feb 8, 2018
soumith pushed a commit that referenced this pull request Feb 9, 2018
* cherry pick Fix multiprocessing and dataloader tests on Windows (#4453)

* cherry pick Dataloader issues #4643

* fix common IS_WINDOWS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: DataLoader worker (pid 23616) is killed by signal: Terminated.

4 participants