[DCP] Add option to use PrefixStore to create checkpoint background process#166560
[DCP] Add option to use PrefixStore to create checkpoint background process#166560kevinmtang wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166560
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 0fdfe61 with merge base 104b868 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@kevinmtang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84928180. |
68e4a19 to
146388c
Compare
…rocess (pytorch#166560) Summary: DCP checkpoint background process currently determines the port used for pg via get_free_port(). During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call. We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port. This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT. The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1". context: https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/ Test Plan: ``` buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled ``` https://www.internalfb.com/intern/testinfra/testrun/1407375340702140 https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION Differential Revision: D84928180
146388c to
81468a3
Compare
…rocess (pytorch#166560) Summary: DCP checkpoint background process currently determines the port used for pg via get_free_port(). During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call. We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port. This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT. The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1". context: https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/ Test Plan: ``` buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled ``` https://www.internalfb.com/intern/testinfra/testrun/1407375340702140 https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_256-ETT-28070b842e?version=0&env=PRODUCTION Differential Revision: D84928180
81468a3 to
6357f5c
Compare
…rocess (pytorch#166560) Summary: DCP checkpoint background process currently determines the port used for pg via get_free_port(). During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call. We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port. This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT. The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1". context: https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/ Test Plan: ``` buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled ``` https://www.internalfb.com/intern/testinfra/testrun/1407375340702140 https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_256-ETT-28070b842e?version=0&env=PRODUCTION Differential Revision: D84928180
meetv18
left a comment
There was a problem hiding this comment.
Thanks for adding support for this and improving reliability of process based async cp :)
…rocess (pytorch#166560) Summary: DCP checkpoint background process currently determines the port used for pg via get_free_port(). During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call. We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port. This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT. The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1". context: https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/ Test Plan: ``` buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled ``` https://www.internalfb.com/intern/testinfra/testrun/1407375340702140 https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_256-ETT-28070b842e?version=0&env=PRODUCTION Reviewed By: meetv18 Differential Revision: D84928180
6357f5c to
0fdfe61
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…rocess (#166560) Summary: DCP checkpoint background process currently determines the port used for pg via get_free_port(). During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call. We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port. This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT. The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1". context: https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/ Differential Revision: D84928180 Pull Request resolved: #166560 Approved by: https://github.com/meetv18
Summary:
DCP checkpoint background process currently determines the port used for pg via get_free_port().
During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.
We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.
This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.
The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".
context:
https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/
Differential Revision: D84928180
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci