Skip to content

Conversation

@t-vi
Copy link
Collaborator

@t-vi t-vi commented Jul 10, 2020

Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the recent RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 10, 2020

@jeffdaily

@dr-ci
Copy link

dr-ci bot commented Jul 10, 2020

💊 CI failures summary and remediations

As of commit 266d30e (more details on the Dr. CI page):


  • 1/2 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 1/2 broken upstream at merge base c86699d since Jul 15

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 25 times.

@jeffdaily
Copy link
Collaborator

jeffdaily commented Jul 10, 2020

This library (libamdhip64.so) is new as of ROCm 3.5. Can we version guard these changes to preserve some backward compatibility?

@jeffdaily jeffdaily added the module: rocm AMD GPU support for Pytorch label Jul 10, 2020
@t-vi
Copy link
Collaborator Author

t-vi commented Jul 10, 2020

@jeffdaily Sure, what do we need to link for older ROCm to be able to launch kernels?

t-vi added 2 commits July 10, 2020 19:40
Previously we did not link against amdhip64 (roughly equivalent to
cudart). Apparently, the RTDL_GLOBAL fixes prevent the extensions
from finding the symbols needed for launching kernels.
@t-vi t-vi force-pushed the rocm_inline_custom_ops_kernels branch from 3afd2f3 to b18556c Compare July 10, 2020 17:40
@t-vi t-vi requested a review from jeffdaily July 10, 2020 17:41
z = module.cos_add(x, y)
self.assertEqual(z, x.cos() + y.cos())

@unittest.skipIf(not TEST_CUDA, "CUDA not found")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the not (TEST_CUDA or TEST_ROCM) in this PR, or do you want to create a follow-up PR to enable many of these TEST_CUDA-only tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I should note that the test script where you've added this test is one of the "slow" ones that does not run automatically in CI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enable them all. Not at all or not run on all instances?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next, I'll try a non-stupid variant...

@t-vi t-vi force-pushed the rocm_inline_custom_ops_kernels branch from aff0ff8 to 0ba287f Compare July 11, 2020 12:31
@ailzhang ailzhang requested a review from jeffdaily July 13, 2020 15:59
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2020
@ezyang
Copy link
Contributor

ezyang commented Jul 14, 2020

doesn't seem to be enough, rocm test still failing

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 14, 2020

Yes, I'm at a loss why it happens, it does not on my ROCm box.

@jeffdaily
Copy link
Collaborator

It doesn't happen on my local box, either. However, the missing symbol that CI is complaining about is in libhip_hcc.so. Should you try adding that for ROCm <= 3.3?

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 14, 2020

Ha, I'll try that. Thank you!

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 15, 2020

Now there is a bad interaction with a new patch, as we hit this line:

if int(torch.version.cuda.split('.')[0]) < 11:

I'll send a fix.

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 15, 2020

@ezyang @jeffdaily I'd claim that it's working now and the windows test failure is from master.

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Does test_cpp_extensions_jit.py run during CI?

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 15, 2020

@jeffdaily

Does test_cpp_extensions_jit.py run during CI?

Yes, in the test2 on pytorch-linux-xenial-rocm3.5.1-py3.6 it is around (in plain text time or at least system time), there is:
12:40:11 test_inline_jit_compile_custom_op_cuda (__main__.TestCppExtensionJIT)
in this file:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-rocm3.5.1-py3.6-test2/201/timestamps/?time=HH:mm:ss&appendLog&locale=en_US

@jeffdaily
Copy link
Collaborator

jeffdaily commented Jul 15, 2020

This PR #40800 added the torch.version.cuda() that you had to fix when you rebased. It got reverted since that change broke ROCm. Your fix was probably sufficient, but the entire PR got reverted, so now you need to rebase this PR again.

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 15, 2020

No, I'll stop here, really.

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 16, 2020

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 16, 2020
@t-vi
Copy link
Collaborator Author

t-vi commented Jul 16, 2020

I'd be keen to get this in before it breaks again... 😉

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@t-vi
Copy link
Collaborator Author

t-vi commented Jul 17, 2020

@ezyang If you have a hint what I need to check from the failed FB internal test...?

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 0f78e59.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged module: rocm AMD GPU support for Pytorch open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants