-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Allow RPC to be initialized again after shutdown. #42723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow RPC to be initialized again after shutdown. #42723
Conversation
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) [ghstack-poisoned]
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) ghstack-source-id: 109412898 Pull Request resolved: #42723
💊 CI failures summary and remediationsAs of commit 1454298 (more details on the Dr. CI page):
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
| TORCH_CHECK( | ||
| !container.initialized_, | ||
| "Container is already initialized! Cannot initialize it twice!"); | ||
| !container.initialized_ || (worker_id == container.worker_id_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean if we shut down RPC, and then re-init, we can't change the passed in rank? I don't think there's a reason to do so, but just confirming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. I think it is probably safer to not allow rank changes and its easier to also just reuse the same container.
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, had a couple of questions and there are some test failures.
| typeParser_ = std::make_shared<jit::ScriptTypeParser>( | ||
| std::make_shared<PythonTypeResolver>()); | ||
| void PythonRpcHandler::init() { | ||
| if (!initialized_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what's the motivation for this atomic and if it's to prevent races? If there could be thread races, this atomic won't be enough to ensure initialization only once right? If there's no races, why do we need an atomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can probably be executed twice and that is safe and would happen rarely. Apart from that, the atomic is mostly to act as a memory barrier to ensure other threads are informed about this change. For example, if one thread calls cleanup and another thread calls getInstance after it, if we don't use an atomic the thread calling getInstance might still see the old value of initialized.
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) [ghstack-poisoned]
Pull Request resolved: #42723 This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. ghstack-source-id: 109478799 Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/) [ghstack-poisoned]
Pull Request resolved: #42723 This PR is addressing #39340 and allows users to initialize RPC again after shutdown. Major changes in the PR include: 1. Change to DistAutogradContainer to support this. 2. Ensure PythonRpcHandler is reinitialized appropriately. 3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a different prefix. ghstack-source-id: 109805368 Differential Revision: [D22993909](https://our.internmc.facebook.com/intern/diff/D22993909/)
|
This pull request has been merged in 89b0b3b. |
|
|
||
| @dist_init(setup_rpc=False) | ||
| def test_init_rpc_twice(self): | ||
| initialize_pg(self.init_method, self.rank, self.world_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we missed testing this, though it looks like this test fails depending on the init_method. For example if users want to init RPC twice with TCPStore, it currently does not work (see details in #46491).
Stack from ghstack:
This PR is addressing #39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:
init_rpcuses adifferent prefix.
Differential Revision: D22993909