Skip to content

Conversation

@wz337
Copy link
Contributor

@wz337 wz337 commented Nov 21, 2022

This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.

Docstring and comments will be added in the following PRs.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 21, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89399

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74245e4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wz337 wz337 changed the title add dedup tensors [PT-D][Checkpoint] Add dedup_tensors for distributed checkpoint Nov 21, 2022
@wz337 wz337 changed the title [PT-D][Checkpoint] Add dedup_tensors for distributed checkpoint [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint Nov 21, 2022
@wz337 wz337 requested a review from wanchaol November 21, 2022 15:56
@wz337 wz337 marked this pull request as ready for review November 21, 2022 15:56
@wz337 wz337 changed the title [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint [Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint Nov 21, 2022
@wz337 wz337 changed the title [Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint [Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint to core distributed Nov 21, 2022
@wz337 wz337 changed the title [Checkpoint][2D][1/N] Move dedup_tensors for distributed checkpoint to core distributed [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed Nov 21, 2022
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please add some docstrs in follow up PRs.

from torch.distributed.checkpoint.dedup_tensors import dedup_tensors
from torch.distributed.checkpoint.planner import SavePlan, WriteItemType
from torch.distributed.checkpoint.planner_helpers import (
_create_write_item_for_tensor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there a standard in checkpointing code where public and private APIs are exposed? i.e. for this planner_helpers we can also make it a public API since it's being used out of the planner_helpers itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not 100% sure of what the distinction for public and private APIs for checkpoint. I am following what Rodrigo has previously here. To my observation, it seems everything used by a class is public and every helper function under public function is private.

@wz337
Copy link
Contributor Author

wz337 commented Nov 22, 2022

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 22, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

wz337 added a commit to pytorch/PiPPy that referenced this pull request Nov 23, 2022
…ctly (#638)

This updates dt_planner to use dedup_tensors API from PyTorch directly,
as it has been added recently in this PR,
pytorch/pytorch#89399.
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
… core distributed (pytorch#89399)

This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.

Docstring and comments will be added in the following PRs.
Pull Request resolved: pytorch#89399
Approved by: https://github.com/wanchaol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants