Skip to content

Conversation

@wz337
Copy link
Contributor

@wz337 wz337 commented Nov 22, 2022

This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT.

Docstring, individual and integration test will be added in the following PRs.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89501

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9f6be82:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wz337 wz337 marked this pull request as ready for review November 22, 2022 16:34
@wz337 wz337 requested a review from wanchaol November 22, 2022 16:34
@wz337 wz337 changed the title [Checkpoint][2D][1/N] Add nested_tensors for distributed checkpoint to core distributed [Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed Nov 22, 2022
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about why the placement is always on cuda:0? stamp to unblock the migration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems not used at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be used in the optimizer.py, which will be upstreamed in the following PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is cuda:0 only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a placeholder, because calling _init_from_local_shards_and_global_metadata() in line 110 requires it as an input arg.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test with regard to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an integration test once the planner is updated with 2D functionality. There is no individual test for this yet. Will add a test as test improvement once everything is moved over.

@wz337
Copy link
Contributor Author

wz337 commented Nov 28, 2022

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team Raised by workflow job

@wz337
Copy link
Contributor Author

wz337 commented Nov 28, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add_nested_tensors onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add_nested_tensors && git pull --rebase)

@wz337
Copy link
Contributor Author

wz337 commented Nov 28, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

from .utils import _element_wise_add


# TODO: update docstring for nested_tensor.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we creating another file with a name that's matching something existing? Is there a plan to use the actual NestedTensor here eventually? cc @drisspg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that this naming clash is confusing. For reference this is documentation on NestedTensor: https://pytorch.org/docs/stable/nested.html

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…o core distributed (pytorch#89501)

This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT.

Docstring, individual and integration test will be added in the following PRs.
Pull Request resolved: pytorch#89501
Approved by: https://github.com/wanchaol
pytorchmergebot pushed a commit that referenced this pull request Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants