Skip to content

[DCP] Add option to use PrefixStore to create checkpoint background process#166560

Closed
kevinmtang wants to merge 1 commit intopytorch:mainfrom
kevinmtang:export-D84928180
Closed

[DCP] Add option to use PrefixStore to create checkpoint background process#166560
kevinmtang wants to merge 1 commit intopytorch:mainfrom
kevinmtang:export-D84928180

Conversation

@kevinmtang
Copy link
Contributor

@kevinmtang kevinmtang commented Oct 29, 2025

Summary:
DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Differential Revision: D84928180

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166560

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 0fdfe61 with merge base 104b868 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Oct 29, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 29, 2025

@kevinmtang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84928180.

kevinmtang added a commit to kevinmtang/pytorch that referenced this pull request Oct 30, 2025
…rocess (pytorch#166560)

Summary:

DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
 https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Test Plan:
```
buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled
```
https://www.internalfb.com/intern/testinfra/testrun/1407375340702140

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION

Differential Revision: D84928180
kevinmtang added a commit to kevinmtang/pytorch that referenced this pull request Oct 30, 2025
…rocess (pytorch#166560)

Summary:

DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
 https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Test Plan:
```
buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled
```
https://www.internalfb.com/intern/testinfra/testrun/1407375340702140

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_256-ETT-28070b842e?version=0&env=PRODUCTION

Differential Revision: D84928180
kevinmtang added a commit to kevinmtang/pytorch that referenced this pull request Oct 30, 2025
…rocess (pytorch#166560)

Summary:

DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
 https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Test Plan:
```
buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled
```
https://www.internalfb.com/intern/testinfra/testrun/1407375340702140

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_256-ETT-28070b842e?version=0&env=PRODUCTION

Differential Revision: D84928180
@kevinmtang kevinmtang requested a review from meetv18 October 30, 2025 18:32
Copy link
Contributor

@meetv18 meetv18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding support for this and improving reliability of process based async cp :)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 3, 2025
…rocess (pytorch#166560)

Summary:

DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
 https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Test Plan:
```
buck test fbcode//mode/opt fbcode//caffe2/test/distributed/checkpoint:test_async_process_executor -- --run-disabled
```
https://www.internalfb.com/intern/testinfra/testrun/1407375340702140

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_16-ETT-1348807470?job_attempt=1&version=0&tab=summary&env=PRODUCTION

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-omnifm_0928_6kx_256-ETT-28070b842e?version=0&env=PRODUCTION

Reviewed By: meetv18

Differential Revision: D84928180
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Nov 4, 2025
…rocess (#166560)

Summary:
DCP checkpoint background process currently determines the port used for pg via get_free_port().

During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.

We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.

This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.

The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".

context:
 https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/

Differential Revision: D84928180

Pull Request resolved: #166560
Approved by: https://github.com/meetv18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants