[Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed #89501

wz337 · 2022-11-22T16:33:06Z

This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT.

Docstring, individual and integration test will be added in the following PRs.

pytorch-bot · 2022-11-22T16:33:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89501

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9f6be82:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol

Some questions about why the placement is always on cuda:0? stamp to unblock the migration

wanchaol · 2022-11-23T17:43:36Z

torch/distributed/checkpoint/utils.py

this seems not used at all?

It will be used in the optimizer.py, which will be upstreamed in the following PR.

wanchaol · 2022-11-23T17:45:31Z

torch/distributed/checkpoint/nested_tensor.py

why this is cuda:0 only?

I believe this is a placeholder, because calling _init_from_local_shards_and_global_metadata() in line 110 requires it as an input arg.

wanchaol · 2022-11-23T17:46:50Z

torch/distributed/checkpoint/nested_tensor.py

is there a test with regard to this function?

There is an integration test once the planner is updated with 2D functionality. There is no individual test for this yet. Will add a test as test improvement once everything is moved over.

wz337 · 2022-11-28T19:44:37Z

@pytorchmergebot merge

pytorchmergebot · 2022-11-28T19:46:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-28T19:46:24Z

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team

Raised by workflow job

wz337 · 2022-11-28T19:47:17Z

@pytorchmergebot rebase

pytorchmergebot · 2022-11-28T19:49:37Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-11-28T19:49:42Z

Successfully rebased add_nested_tensors onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add_nested_tensors && git pull --rebase)

wz337 · 2022-11-28T20:37:56Z

@pytorchmergebot merge

pytorchmergebot · 2022-11-28T20:39:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

cpuhrsch · 2022-11-28T23:39:05Z

torch/distributed/checkpoint/nested_tensor.py

+from .utils import _element_wise_add
+
+
+# TODO: update docstring for nested_tensor.py


Why are we creating another file with a name that's matching something existing? Is there a plan to use the actual NestedTensor here eventually? cc @drisspg

Agree, that this naming clash is confusing. For reference this is documentation on NestedTensor: https://pytorch.org/docs/stable/nested.html

…o core distributed (pytorch#89501) This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT. Docstring, individual and integration test will be added in the following PRs. Pull Request resolved: pytorch#89501 Approved by: https://github.com/wanchaol

…92705) Fixes #90350. Pull Request resolved: #92705 Approved by: https://github.com/kumpera

wz337 marked this pull request as ready for review November 22, 2022 16:34

wz337 requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 22, 2022 16:34

wz337 requested a review from wanchaol November 22, 2022 16:34

wz337 changed the title ~~[Checkpoint][2D][1/N] Add nested_tensors for distributed checkpoint to core distributed~~ [Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed Nov 22, 2022

wanchaol approved these changes Nov 23, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2022

wz337 added 2 commits November 28, 2022 19:49

add nested_tensors

97fdac9

add missing utils for nested_tensors

9f6be82

pytorchmergebot added the Merged label Nov 28, 2022

pytorchmergebot closed this in 22e7514 Nov 28, 2022

cpuhrsch reviewed Nov 28, 2022

View reviewed changes

wz337 mentioned this pull request Dec 7, 2022

[PT-D][checkpoint] Remove torch/distributed/checkpoint/nested_tensor.py and move the private API to checkpoint utils #90350

Closed

pytorchmergebot pushed a commit that referenced this pull request Jan 23, 2023

[PT-D][Checkpoint]Resolve issue #89501: Rename _nested_tensor.py to (#…

f7e1f3e

…92705) Fixes #90350. Pull Request resolved: #92705 Approved by: https://github.com/kumpera

		from .utils import _element_wise_add


		# TODO: update docstring for nested_tensor.py

[Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed #89501

[Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed #89501

Uh oh!

Conversation

wz337 commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89501

✅ No Failures

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wz337 commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 28, 2022

Merge failed

Uh oh!

wz337 commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Uh oh!

wz337 commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wz337 commented Nov 22, 2022 •

edited

Loading

pytorch-bot bot commented Nov 22, 2022 •

edited

Loading