Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Sep 9, 2018

Process.start() actually take some time as it needs to start a
process and pass the arguments over via a pipe. Therefore, we
only add a worker to self.workers list after it started, so
that we do not call .join() if program dies before it starts,
and __del__ tries to join it but will get:
AssertionError: can only join a started process.

Example trace when such error happens:

[unrelated]
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 500, in __iter__
    return _DataLoaderIter(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 292, in __init__
    w.start()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
KeyboardInterrupt
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa704d5aa60>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 398, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 392, in _shutdown_workers
    w.join()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 139, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

No test because hard to reliably trigger.

@ssnl ssnl changed the title Only joint started dataloader workers Only join started dataloader workers Sep 9, 2018
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl ssnl deleted the join_after_start branch September 9, 2018 20:25
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
`Process.start()` actually take some time as it needs to start a
process and pass the arguments over via a pipe. Therefore, we
only add a worker to self.workers list after it started, so
that we do not call `.join()` if program dies before it starts,
and `__del__` tries to join it but will get:
    AssertionError: can only join a started process.

Example trace when such error happens:
```py
[unrelated]
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 500, in __iter__
    return _DataLoaderIter(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 292, in __init__
    w.start()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
KeyboardInterrupt
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa704d5aa60>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 398, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 392, in _shutdown_workers
    w.join()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 139, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
```

No test because hard to reliably trigger.
Pull Request resolved: pytorch#11432

Reviewed By: ezyang

Differential Revision: D9735430

Pulled By: SsnL

fbshipit-source-id: a8912d9bb4063f210d6236267b178173810e2351
facebook-github-bot pushed a commit that referenced this pull request Sep 13, 2018
Summary:
Same reason as in #11432 .

Example error:
```
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa06963cf28>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 405, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 401, in _shutdown_workers
    self.pin_memory_thread.join()
AttributeError: '_DataLoaderIter' object has no attribute 'pin_memory_thread'
```
Pull Request resolved: #11599

Differential Revision: D9801143

Pulled By: SsnL

fbshipit-source-id: 520590a21f56fa381fcac621457a7544d3fba47e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants