Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Aug 7, 2020

Stack from ghstack:

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

  1. Change to DistAutogradContainer to support this.
  2. Ensure PythonRpcHandler is reinitialized appropriately.
  3. Use PrefixStore in RPC initialization to ensure each new init_rpc uses a
    different prefix.

Differential Revision: D22993909

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Aug 7, 2020
This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

ghstack-source-id: 109412898
Pull Request resolved: #42723
@dr-ci
Copy link

dr-ci bot commented Aug 7, 2020

💊 CI failures summary and remediations

As of commit 1454298 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test1 (1/1)

Step: "Download Docker image" (full log | diagnosis details | 🔁 rerun) ❄️

Fp2ts8glJWkyKdJ%2Fdcpiq3DQ6RoGbvt%2BtLdsuXvbVjFDT87vgHEtFbTYVMb7olX%2F5NFIYduDD649sXe%2Bpu4fXPR3QjrFdZG3R%2FQ2YinyrPUtXh9iGKOK5SoodAThNGyjLf4XpdnckgGvul9qI3jPE5LhH7%2F1iYZ3PHtbaA5UiZwlw2g%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200813T021149Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=ASIAYTIFIPBEB35VPAXL%2F20200813%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=e679487418366d9f2f0b3a454928a73457349da9d1aef252f081ae88d0e3cdef: dial tcp 52.216.65.69:443: i/o timeout
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:a57b8c8191ee0d61aff57300eca0ae1f26c749bb-14542984039f09ddf816336fa65300d4cb13388c 
Parallel backend flags:  
p2ts8glJWkyKdJ%2Fdcpiq3DQ6RoGbvt%2BtLdsuXvbVjFDT87vgHEtFbTYVMb7olX%2F5NFIYduDD649sXe%2Bpu4fXPR3QjrFdZG3R%2FQ2YinyrPUtXh9iGKOK5SoodAThNGyjLf4XpdnckgGvul9qI3jPE5LhH7%2F1iYZ3PHtbaA5UiZwlw2g%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200813T021149Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=ASIAYTIFIPBEB35VPAXL%2F20200813%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=e679487418366d9f2f0b3a454928a73457349da9d1aef252f081ae88d0e3cdef: dial tcp 52.216.65.69:443: i/o timeout 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

TORCH_CHECK(
!container.initialized_,
"Container is already initialized! Cannot initialize it twice!");
!container.initialized_ || (worker_id == container.worker_id_),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean if we shut down RPC, and then re-init, we can't change the passed in rank? I don't think there's a reason to do so, but just confirming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. I think it is probably safer to not allow rank changes and its easier to also just reuse the same container.

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, had a couple of questions and there are some test failures.

typeParser_ = std::make_shared<jit::ScriptTypeParser>(
std::make_shared<PythonTypeResolver>());
void PythonRpcHandler::init() {
if (!initialized_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what's the motivation for this atomic and if it's to prevent races? If there could be thread races, this atomic won't be enough to ensure initialization only once right? If there's no races, why do we need an atomic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can probably be executed twice and that is safe and would happen rarely. Apart from that, the atomic is mostly to act as a memory barrier to ensure other threads are informed about this change. For example, if one thread calls cleanup and another thread calls getInstance after it, if we don't use an atomic the thread calling getInstance might still see the old value of initialized.

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Aug 7, 2020
Pull Request resolved: #42723

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109478799

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Aug 13, 2020
Pull Request resolved: #42723

This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109805368

Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 89b0b3b.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/151/head branch August 17, 2020 14:16

@dist_init(setup_rpc=False)
def test_init_rpc_twice(self):
initialize_pg(self.init_method, self.rank, self.world_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we missed testing this, though it looks like this test fails depending on the init_method. For example if users want to init RPC twice with TCPStore, it currently does not work (see details in #46491).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants