Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 15, 2022

Stack from ghstack (oldest at bottom):

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 15, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89096

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Failures

As of commit 55d73a6:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Nov 15, 2022
ghstack-source-id: 558da2a
Pull Request resolved: #89096
@wconstab wconstab requested a review from msaroufim November 15, 2022 22:54
.. code::
ddp_model = DDP(model, device_ids=[rank])
ddp_model = torch.compile(ddp_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can't merge this quite yet since won't exist until sometime next week, in the meantime you can can use the optimize API if you'd rather merge this now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no rush, i can just land it once its ready. keep me posted.

------------------------

DDP's performance advantage comes from overlapping allreduce collectives with computations during backwards.
AotAutograd prevents this overlap when used with TorchDynamo for compiling a whole forward and whole backward graph,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine the DDP audience may not know what AotAutograd is, I'd rather expanding on this a bit more

Maybe a picture would help make things clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's better to duplicate the picture/explanation here, or would a link out to @davidberard98's blog suffice? He explains it well and has pictures

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. link is a paragraph below, but i could move it up if you think it would help.

wconstab added a commit that referenced this pull request Nov 16, 2022
ghstack-source-id: b4d0a1a
Pull Request resolved: #89096
wconstab added a commit that referenced this pull request Nov 22, 2022
ghstack-source-id: a847973
Pull Request resolved: #89096
wconstab added a commit that referenced this pull request Nov 29, 2022
ghstack-source-id: a8e4fde
Pull Request resolved: #89096
wconstab added a commit that referenced this pull request Nov 29, 2022
ghstack-source-id: 464ccb5
Pull Request resolved: #89096
wconstab added a commit that referenced this pull request Nov 29, 2022
ghstack-source-id: 43a8e57
Pull Request resolved: #89096
wconstab added a commit that referenced this pull request Nov 29, 2022
ghstack-source-id: 88cc651
Pull Request resolved: #89096
@wconstab wconstab added the release notes: distributed (ddp) release notes category label Nov 29, 2022
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 29, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 4 additional jobs have failed, first few of them are: windows-binary-libtorch-debug ,windows-binary-libtorch-debug / libtorch-cpu-shared-with-deps-debug-test ,trunk ,trunk / win-vs2019-cuda11.6-py3 / test (force_on_cpu, 1, 1, windows.4xlarge)

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Contributor Author

@pytorchbot merge -f "unrelated CI fail"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/38/head branch June 8, 2023 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (ddp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants