ROCm: Fix linking of custom ops in load_inline #41257

t-vi · 2020-07-10T15:51:29Z

Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the recent RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.

t-vi · 2020-07-10T15:52:28Z

@jeffdaily

dr-ci · 2020-07-10T15:54:16Z

💊 CI failures summary and remediations

As of commit 266d30e (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 broken upstream at merge base c86699d since Jul 15

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test since Jul 15
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 25 times.

jeffdaily · 2020-07-10T16:02:36Z

This library (libamdhip64.so) is new as of ROCm 3.5. Can we version guard these changes to preserve some backward compatibility?

t-vi · 2020-07-10T16:08:51Z

@jeffdaily Sure, what do we need to link for older ROCm to be able to launch kernels?

Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.

jeffdaily · 2020-07-10T17:48:34Z

test/test_cpp_extensions_jit.py

        z = module.cos_add(x, y)
        self.assertEqual(z, x.cos() + y.cos())

+    @unittest.skipIf(not TEST_CUDA, "CUDA not found")


Should we add the not (TEST_CUDA or TEST_ROCM) in this PR, or do you want to create a follow-up PR to enable many of these TEST_CUDA-only tests?

Also I should note that the test script where you've added this test is one of the "slow" ones that does not run automatically in CI.

Let's enable them all. Not at all or not run on all instances?

Next, I'll try a non-stupid variant...

ezyang · 2020-07-14T00:55:11Z

doesn't seem to be enough, rocm test still failing

t-vi · 2020-07-14T05:54:09Z

Yes, I'm at a loss why it happens, it does not on my ROCm box.

jeffdaily · 2020-07-14T15:32:58Z

It doesn't happen on my local box, either. However, the missing symbol that CI is complaining about is in libhip_hcc.so. Should you try adding that for ROCm <= 3.3?

t-vi · 2020-07-14T15:44:31Z

Ha, I'll try that. Thank you!

t-vi · 2020-07-15T05:06:09Z

Now there is a bad interaction with a new patch, as we hit this line:

pytorch/torch/testing/_internal/common_cuda.py

Line 47 in 288ece8

if int(torch.version.cuda.split('.')[0]) < 11:

I'll send a fix.

t-vi · 2020-07-15T13:56:57Z

@ezyang @jeffdaily I'd claim that it's working now and the windows test failure is from master.

jeffdaily

LGTM. Does test_cpp_extensions_jit.py run during CI?

t-vi · 2020-07-15T15:18:34Z

@jeffdaily

Does test_cpp_extensions_jit.py run during CI?

Yes, in the test2 on pytorch-linux-xenial-rocm3.5.1-py3.6 it is around (in plain text time or at least system time), there is:
12:40:11 test_inline_jit_compile_custom_op_cuda (__main__.TestCppExtensionJIT)
in this file:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-rocm3.5.1-py3.6-test2/201/timestamps/?time=HH:mm:ss&appendLog&locale=en_US

jeffdaily · 2020-07-15T18:03:26Z

This PR #40800 added the torch.version.cuda() that you had to fix when you rebased. It got reverted since that change broke ROCm. Your fix was probably sufficient, but the entire PR got reverted, so now you need to rebase this PR again.

t-vi · 2020-07-15T18:11:39Z

No, I'll stop here, really.

t-vi · 2020-07-16T08:52:09Z

@pytorchbot merge this please

t-vi · 2020-07-16T08:52:37Z

I'd be keen to get this in before it breaks again... 😉

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

t-vi · 2020-07-17T13:24:26Z

@ezyang If you have a hint what I need to check from the failed FB internal test...?

facebook-github-bot · 2020-07-17T19:40:45Z

@ezyang merged this pull request in 0f78e59.

t-vi requested review from ezyang, fmassa, goldsborough and soumith as code owners July 10, 2020 15:51

pytorchbot added the open source label Jul 10, 2020

jeffdaily added the module: rocm AMD GPU support for Pytorch label Jul 10, 2020

t-vi added 2 commits July 10, 2020 19:40

ROCm: Fix linking of custom ops in load_inline

15a04f5

Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.

compat with rocm < 3.5, thank you Jeff!

b18556c

t-vi force-pushed the rocm_inline_custom_ops_kernels branch from 3afd2f3 to b18556c Compare July 10, 2020 17:40

t-vi requested a review from jeffdaily July 10, 2020 17:41

jeffdaily reviewed Jul 10, 2020

View reviewed changes

t-vi added 2 commits July 10, 2020 19:53

enable tests on ROCM

994e435

a list doesn't convert to float but to tuple

0ba287f

t-vi force-pushed the rocm_inline_custom_ops_kernels branch from aff0ff8 to 0ba287f Compare July 11, 2020 12:31

ailzhang requested a review from jeffdaily July 13, 2020 15:59

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2020

hip_hcc for ROCM < 3.5

e446572

t-vi added 2 commits July 15, 2020 13:25

Merge branch 'master' into rocm_inline_custom_ops_kernels

4c696f5

rocm doesn't have cuda version

b872ac7

jeffdaily approved these changes Jul 15, 2020

View reviewed changes

t-vi mentioned this pull request Jul 15, 2020

Enable TF32 support for cuBLAS #40800

Closed

Merge branch 'master' into rocm_inline_custom_ops_kernels

266d30e

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 16, 2020

facebook-github-bot reviewed Jul 16, 2020

View reviewed changes

facebook-github-bot closed this in 0f78e59 Jul 17, 2020

facebook-github-bot added the merged label Jul 17, 2020

mruberry added the Merged label Oct 28, 2020

ROCm: Fix linking of custom ops in load_inline #41257

ROCm: Fix linking of custom ops in load_inline #41257

Uh oh!

Conversation

t-vi commented Jul 10, 2020

Uh oh!

t-vi commented Jul 10, 2020

Uh oh!

dr-ci bot commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

ci.pytorch.org: 1 failed

Uh oh!

jeffdaily commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-vi commented Jul 10, 2020

Uh oh!

jeffdaily Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

jeffdaily Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

t-vi Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

t-vi Jul 11, 2020

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 14, 2020

Uh oh!

t-vi commented Jul 14, 2020

Uh oh!

jeffdaily commented Jul 14, 2020

Uh oh!

t-vi commented Jul 14, 2020

Uh oh!

t-vi commented Jul 15, 2020

Uh oh!

t-vi commented Jul 15, 2020

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

t-vi commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-vi commented Jul 15, 2020

Uh oh!

t-vi commented Jul 16, 2020

Uh oh!

t-vi commented Jul 16, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

t-vi commented Jul 17, 2020

Uh oh!

facebook-github-bot commented Jul 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dr-ci bot commented Jul 10, 2020 •

edited

Loading

jeffdaily commented Jul 10, 2020 •

edited

Loading

t-vi commented Jul 15, 2020 •

edited

Loading

jeffdaily commented Jul 15, 2020 •

edited

Loading