Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented Nov 2, 2018

This helper addresses a common pattern where one spawns N processes to
work on some common task (e.g. parallel preprocessing or multiple
training loops).

A straightforward approach is to use the multiprocessing API directly
and then consecutively call join on the resulting processes.

This pattern breaks down in the face of errors. If one of the
processes terminates with an exception or via some signal, and it is
not the first process that was launched, the join call on the first
process won't be affected. This helper seeks to solve this by waiting
on termination from any of the spawned processes. When any process
terminates with a non-zero exit status, it terminates the remaining
processes, and raises an exception in the parent process. If the
process terminated with an exception, it is propagated to the parent.
If the process terminated via a signal (e.g. SIGINT, SIGSEGV), this is
mentioned in the exception as well.

Requires Python >= 3.4.

@pietern
Copy link
Contributor Author

pietern commented Nov 2, 2018

Tested on Linux. Expect this to work on macOS. Unsure about Windows.

@teng-li
Copy link
Contributor

teng-li commented Nov 2, 2018

My 2cents, I would prefer this is under torch.distributed.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

You can build and preview docs via cd docs; make singlehtml and then use your favorite method to serve the HTML.

(Here's a script that works with IPv6 only machines: https://gist.github.com/colesbury/b8cf3f8a2346821fdacb59ab981d4456)

This comment was marked as off-topic.

@pietern pietern force-pushed the multiprocessing-spawn branch from 021d13e to 54e56dd Compare November 6, 2018 06:41
@pietern
Copy link
Contributor Author

pietern commented Nov 6, 2018

Thanks for the hints on the docs @colesbury. Since the code here is blocking some folks I will merge this first and address the documentation tomorrow. Then we can iterate on the wording without a rush.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@pietern pietern force-pushed the multiprocessing-spawn branch from 54e56dd to 13c9207 Compare November 6, 2018 14:11
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This helper addresses a common pattern where one spawns N processes to
work on some common task (e.g. parallel preprocessing or multiple
training loops).

A straightforward approach is to use the multiprocessing API directly
and then consecutively call join on the resulting processes.

This pattern breaks down in the face of errors. If one of the
processes terminates with an exception or via some signal, and it is
not the first process that was launched, the join call on the first
process won't be affected. This helper seeks to solve this by waiting
on termination from any of the spawned processes. When any process
terminates with a non-zero exit status, it terminates the remaining
processes, and raises an exception in the parent process. If the
process terminated with an exception, it is propagated to the parent.
If the process terminated via a signal (e.g. SIGINT, SIGSEGV), this is
mentioned in the exception as well.

Requires Python >= 3.4.
@pietern pietern force-pushed the multiprocessing-spawn branch from 13c9207 to 33e98fc Compare November 6, 2018 16:45
@pietern
Copy link
Contributor Author

pietern commented Nov 6, 2018

Updates to fix ROCM and Windows builds.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@pietern pietern deleted the multiprocessing-spawn branch November 6, 2018 22:26
@ezyang ezyang added the merged label Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants