Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

No description provided.

@dr-ci
Copy link

dr-ci bot commented May 14, 2020

💊 CI failures summary and remediations

As of commit 08cd7ca (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 29 times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _uniform() is not defined for CPU bfloat16 tensors, we call _uniform() on a float32 tensor and then convert to required dtype.

@jithunnair-amd jithunnair-amd marked this pull request as ready for review May 16, 2020 05:13
@jithunnair-amd
Copy link
Collaborator Author

Please note that ROCm CI builds seem to be down temporarily, however a code review will still be helpful.

@jeffdaily
Copy link
Collaborator

@pytorchbot retest this please

1 similar comment
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot retest this please

@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2020
facebook-github-bot pushed a commit that referenced this pull request Jun 5, 2020
…_nccl tests (#39354)

Summary:
All individual test_nccl unit tests have been disabled for ROCm in bf93954
test_nccl was also added to the ROCM_BLACKLIST in 87b198d
However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: #38689

This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: #38515
Pull Request resolved: #39354

Differential Revision: D21843194

Pulled By: mrshenli

fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35
@jithunnair-amd jithunnair-amd force-pushed the enable_nccl_for_bfloat16 branch from 53e0686 to b02cb4b Compare June 5, 2020 22:51
@jithunnair-amd
Copy link
Collaborator Author

@mrshenli Kindly review.

@jeffdaily jeffdaily added the module: rocm AMD GPU support for Pytorch label Jul 6, 2020
@jeffdaily
Copy link
Collaborator

Ping.
@ezyang or @xw285cornell if you could help encourage a review. Thanks.

Copy link
Contributor

@xw285cornell xw285cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Are we on 3.5 docker image already? If not, should we wait until 3.5 to land before we merge this PR? Otherwise the tests are actually not testing this PR.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in eea5357.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in eea5357.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: rocm AMD GPU support for Pytorch open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants