-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed #89399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89399
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 74245e4: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
wanchaol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, please add some docstrs in follow up PRs.
| from torch.distributed.checkpoint.dedup_tensors import dedup_tensors | ||
| from torch.distributed.checkpoint.planner import SavePlan, WriteItemType | ||
| from torch.distributed.checkpoint.planner_helpers import ( | ||
| _create_write_item_for_tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is there a standard in checkpointing code where public and private APIs are exposed? i.e. for this planner_helpers we can also make it a public API since it's being used out of the planner_helpers itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not 100% sure of what the distinction for public and private APIs for checkpoint. I am following what Rodrigo has previously here. To my observation, it seems everything used by a class is public and every helper function under public function is private.
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ctly (#638) This updates dt_planner to use dedup_tensors API from PyTorch directly, as it has been added recently in this PR, pytorch/pytorch#89399.
… core distributed (pytorch#89399) This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement. Docstring and comments will be added in the following PRs. Pull Request resolved: pytorch#89399 Approved by: https://github.com/wanchaol
This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.
This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.
Docstring and comments will be added in the following PRs.