-
Notifications
You must be signed in to change notification settings - Fork 26.3k
ROCm: Fix linking of custom ops in load_inline #41257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 266d30e (more details on the Dr. CI page):
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet: ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 25 times. |
|
This library (libamdhip64.so) is new as of ROCm 3.5. Can we version guard these changes to preserve some backward compatibility? |
|
@jeffdaily Sure, what do we need to link for older ROCm to be able to launch kernels? |
Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.
3afd2f3 to
b18556c
Compare
test/test_cpp_extensions_jit.py
Outdated
| z = module.cos_add(x, y) | ||
| self.assertEqual(z, x.cos() + y.cos()) | ||
|
|
||
| @unittest.skipIf(not TEST_CUDA, "CUDA not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the not (TEST_CUDA or TEST_ROCM) in this PR, or do you want to create a follow-up PR to enable many of these TEST_CUDA-only tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I should note that the test script where you've added this test is one of the "slow" ones that does not run automatically in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's enable them all. Not at all or not run on all instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next, I'll try a non-stupid variant...
aff0ff8 to
0ba287f
Compare
|
doesn't seem to be enough, rocm test still failing |
|
Yes, I'm at a loss why it happens, it does not on my ROCm box. |
|
It doesn't happen on my local box, either. However, the missing symbol that CI is complaining about is in libhip_hcc.so. Should you try adding that for ROCm <= 3.3? |
|
Ha, I'll try that. Thank you! |
|
Now there is a bad interaction with a new patch, as we hit this line:
I'll send a fix. |
|
@ezyang @jeffdaily I'd claim that it's working now and the windows test failure is from master. |
jeffdaily
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Does test_cpp_extensions_jit.py run during CI?
Yes, in the test2 on pytorch-linux-xenial-rocm3.5.1-py3.6 it is around (in plain text time or at least system time), there is: |
|
This PR #40800 added the |
|
No, I'll stop here, really. |
|
@pytorchbot merge this please |
|
I'd be keen to get this in before it breaks again... 😉 |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@ezyang If you have a hint what I need to check from the failed FB internal test...? |
Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the recent RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.