-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed #89501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89501
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9f6be82: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
wanchaol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about why the placement is always on cuda:0? stamp to unblock the migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems not used at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be used in the optimizer.py, which will be upstreamed in the following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is cuda:0 only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is a placeholder, because calling _init_from_local_shards_and_global_metadata() in line 110 requires it as an input arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a test with regard to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an integration test once the planner is updated with 2D functionality. There is no individual test for this yet. Will add a test as test improvement once everything is moved over.
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| from .utils import _element_wise_add | ||
|
|
||
|
|
||
| # TODO: update docstring for nested_tensor.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we creating another file with a name that's matching something existing? Is there a plan to use the actual NestedTensor here eventually? cc @drisspg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, that this naming clash is confusing. For reference this is documentation on NestedTensor: https://pytorch.org/docs/stable/nested.html
…o core distributed (pytorch#89501) This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT. Docstring, individual and integration test will be added in the following PRs. Pull Request resolved: pytorch#89501 Approved by: https://github.com/wanchaol
…92705) Fixes #90350. Pull Request resolved: #92705 Approved by: https://github.com/kumpera
This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.
This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT.
Docstring, individual and integration test will be added in the following PRs.