Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Jan 30, 2023

Stack from ghstack (oldest at bottom):

torchrec.DistributedModelParallel overwrites named_parameters and is not compatible with FullyShardedDataParallel's optim_state_dict. This PR adds some workaround in FullyShardedDataParallel to make both work together.

Differential Revision: D42764611

`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93285

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8c9e355:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 30, 2023
fegin added a commit that referenced this pull request Jan 30, 2023
`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)

ghstack-source-id: 178786341
Pull Request resolved: #93285
…bile with DMP"

`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jan 31, 2023
Pull Request resolved: #93285

`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.
ghstack-source-id: 178905353

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)
# overwite the flat_parameters traversal result to only obtain
# the last one, which happens to be the correct one.
#
# TODO: Remove this hack once DMP + FSDP is not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean it should be removed once we've landed the composable path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all existing use cases are migrated to the composable path, then yes, we should only support trech_shard + fully_shard but not DMP + FSDP.

…bile with DMP"

`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Feb 2, 2023
Pull Request resolved: #93285

`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.
ghstack-source-id: 179122189

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)
@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 2, 2023
@fegin
Copy link
Contributor Author

fegin commented Feb 2, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@fegin
Copy link
Contributor Author

fegin commented Feb 2, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/fegin/65/head branch June 8, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants