-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Synchronize MAGMA functions with the current CUDA stream #36605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 Build failures summary and remediationsAs of commit 428bdde (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakages:
|
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There probably wasn't any reason why _batched routines with magma_queue_t argument were not used. Likely when these bindings were written magma did not yet support queue argument. cc @vishwakftw in case he knows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the error returned by magma kernels was non-sticky, it won't be checked (because cudaStreamSynchronize will likely succeed). Also, you can use getDefaultCUDAStream here instead of 0 here and above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is functionally correct, and in most cases magma performance is pretty bad anyway so probably we should not micro-optimize, but if pytorch is currently on default stream you don't need a synchronization here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to complete before the host memory is freed immediately afterwords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See its use on line 296.
|
Re: |
If you look at
|
|
I'm going to let @ngimel field this one, lmk if there is anything specific I should look at |
|
Okay, I've tried out substituting single matrix calls with batched calls. The results are not good. Many tests now fail because of illegal memory accesses (#26996). That issue suggests fixing this would require batch sizes to be rounded up, which would mean doing a lot of unnecessary work. I've also had a look into the magma source code and it strongly warns against using the So, I don't think it makes sense to use the batched versions here. |
|
@peterbell10 just FYI, the latest magma source is available on bitbucket (icl/magma) |
|
Ok, let's not use batched versions. Recently we had another issue with batched versions with large matrices being buggy. Can you please rebase so that windows tests have a chance to run? Once CI is green we can merge. |
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
27934ac to
428bdde
Compare
|
Rebased on |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Thank you! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Fixes #21821
This follows @ngimel's suggestion to manually synchronize MAGMA calls with the current stream. This is handled automatically with
MagmaStreamSyncGuard.I think for the functions with
_batchedvariants we could possibly avoid synchronisation by using a batch of size 1 since these have amagma_queue_targument. However, I presume there's a reason it wasn't written like that in the first place.I also figured out why porting to aten "magically fixed"
torch.svd. The magma functions for svd all take host arrays as input and output. The ATen port uses blockingcopy_s which fully synchronize the operation. On the other hand, the THC functions usecudaMemcpywhich doesn't synchronize with streams created withcudaStreamNonBlocking(whichatendoes). The fix is to usecudaMemcpyAsyncandcudaStreamSynchronize, the same ascopy_does internally:pytorch/aten/src/ATen/native/cuda/Copy.cu
Lines 192 to 193 in 835ee34
I'm not sure how to test these changes as I wasn't able to reproduce any of the stream sync issues. Possibly a mixture of non-determinism and because some of these functions are implicitly synchronous anyway.