-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly #139442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139442
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 104eb92 with merge base d0fd42e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Aidyn-A
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. QQ: does at::empty has the nullptr check? If yes, this:
TORCH_CHECK(workspace.data_ptr() != nullptr, "OOM trying to allocate workspace for cublaslt");would be redundant.
|
Does this have performance implications? Like is at::empty overhead going to make this slower than it was before to call cuBlas APIs? |
Did some microbenchmarking finally, seems like the overhead is in the range of 0.5us |
|
@pytorchmergebot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
2622a13 to
0d76f10
Compare
|
@pytorchmergebot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
0d76f10 to
104eb92
Compare
|
@pytorchmergebot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…tensors rather than going to the caching allocator directly (#139442) CC @zdevito @janeyx99 This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause `test_cuda_tracker_equivalence` to fail with a large enough workspace size e.g., `CUBLAS_LT_WORKSPACE_SIZE=32768`. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator. Pull Request resolved: #139442 Approved by: https://github.com/Aidyn-A, https://github.com/albanD, https://github.com/janeyx99
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: atalman Differential Revision: D69257102 fbshipit-source-id: 4a2e6391fa899829758596ab2e2f4b16003e5197
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: jeanschmidt Differential Revision: D70075331 fbshipit-source-id: cf4d0d687b299c942793a758c6fec4b064c44227
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: izaitsevfb Differential Revision: D71711852 fbshipit-source-id: 4f57539b8f37f1f4c92a57c19276e84f81bffa23
…es (pytorch#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also pytorch#120925 + fixes behavior broken behavior with the memtracker; pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: pytorch#145130 Approved by: https://github.com/ngimel
CC @zdevito @janeyx99
This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause
test_cuda_tracker_equivalenceto fail with a large enough workspace size e.g.,CUBLAS_LT_WORKSPACE_SIZE=32768. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator.cc @ptrblck @msaroufim @csarofeen @xwang233