-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add bfloat16 support for nccl path #38515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bfloat16 support for nccl path #38515
Conversation
💊 CI failures summary and remediationsAs of commit 08cd7ca (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 29 times. |
48f2ba0 to
e56ed43
Compare
test/distributed/test_nccl.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _uniform() is not defined for CPU bfloat16 tensors, we call _uniform() on a float32 tensor and then convert to required dtype.
|
Please note that ROCm CI builds seem to be down temporarily, however a code review will still be helpful. |
|
@pytorchbot retest this please |
1 similar comment
|
@pytorchbot retest this please |
…_nccl tests (#39354) Summary: All individual test_nccl unit tests have been disabled for ROCm in bf93954 test_nccl was also added to the ROCM_BLACKLIST in 87b198d However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: #38689 This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: #38515 Pull Request resolved: #39354 Differential Revision: D21843194 Pulled By: mrshenli fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35
53e0686 to
b02cb4b
Compare
|
@mrshenli Kindly review. |
|
Ping. |
xw285cornell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Are we on 3.5 docker image already? If not, should we wait until 3.5 to land before we merge this PR? Otherwise the tests are actually not testing this PR.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
No description provided.