[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed #89399

wz337 · 2022-11-21T05:40:21Z

This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.

Docstring and comments will be added in the following PRs.

pytorch-bot · 2022-11-21T05:40:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89399

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74245e4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol

lgtm, please add some docstrs in follow up PRs.

wanchaol · 2022-11-21T23:49:30Z

test/distributed/checkpoint/test_dedup_tensors.py

+from torch.distributed.checkpoint.dedup_tensors import dedup_tensors
+from torch.distributed.checkpoint.planner import SavePlan, WriteItemType
+from torch.distributed.checkpoint.planner_helpers import (
+    _create_write_item_for_tensor,


nit: is there a standard in checkpointing code where public and private APIs are exposed? i.e. for this planner_helpers we can also make it a public API since it's being used out of the planner_helpers itself?

I am actually not 100% sure of what the distinction for public and private APIs for checkpoint. I am following what Rodrigo has previously here. To my observation, it seems everything used by a class is public and every helper function under public function is private.

wz337 · 2022-11-22T01:03:42Z

@pytorchmergebot merge

pytorchmergebot · 2022-11-22T01:06:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ctly (#638) This updates dt_planner to use dedup_tensors API from PyTorch directly, as it has been added recently in this PR, pytorch/pytorch#89399.

… core distributed (pytorch#89399) This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement. Docstring and comments will be added in the following PRs. Pull Request resolved: pytorch#89399 Approved by: https://github.com/wanchaol

add dedup tensors

16ae0d9

wz337 added 2 commits November 21, 2022 06:18

update __init__.py

a6c1027

fix lint

74245e4

wz337 changed the title ~~add dedup tensors~~ [PT-D][Checkpoint] Add dedup_tensors for distributed checkpoint Nov 21, 2022

wz337 changed the title ~~[PT-D][Checkpoint] Add dedup_tensors for distributed checkpoint~~ [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint Nov 21, 2022

wz337 requested a review from wanchaol November 21, 2022 15:56

wz337 marked this pull request as ready for review November 21, 2022 15:56

wz337 requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 21, 2022 15:56

wz337 changed the title ~~[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint~~ [Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint Nov 21, 2022

wz337 changed the title ~~[Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint~~ [Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint to core distributed Nov 21, 2022

wz337 changed the title ~~[Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint to core distributed~~ [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed Nov 21, 2022

wanchaol approved these changes Nov 21, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 22, 2022

pytorchmergebot added the Merged label Nov 22, 2022

pytorchmergebot closed this in 1dae59b Nov 22, 2022

wz337 mentioned this pull request Nov 22, 2022

[Checkpoint] Update dt_planner to use dedup_tensors from PyTorch directly pytorch/PiPPy#638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed #89399

[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed #89399

Uh oh!

wz337 commented Nov 21, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 21, 2022 •

edited

Loading

Uh oh!

wanchaol left a comment

Uh oh!

wanchaol Nov 21, 2022

Uh oh!

wz337 Nov 22, 2022

Uh oh!

wz337 commented Nov 22, 2022

Uh oh!

pytorchmergebot commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed #89399

[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed #89399

Uh oh!

Conversation

wz337 commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89399

✅ No Failures

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Nov 21, 2022

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

wz337 commented Nov 22, 2022

Uh oh!

pytorchmergebot commented Nov 22, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wz337 commented Nov 21, 2022 •

edited

Loading

pytorch-bot bot commented Nov 21, 2022 •

edited

Loading