[cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly #139442

eqy · 2024-11-01T00:01:12Z

This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause test_cuda_tracker_equivalence to fail with a large enough workspace size e.g., CUBLAS_LT_WORKSPACE_SIZE=32768. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator.

cc @ptrblck @msaroufim @csarofeen @xwang233

pytorch-bot · 2024-11-01T00:01:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139442

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 104eb92 with merge base d0fd42e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2024-11-01T00:02:38Z

CC @nWEIdia @tinglvv as we discussed this

@Aidyn-A as 32MiB default workspace size for H100 is relevant here

Aidyn-A

Looks good to me. QQ: does at::empty has the nullptr check? If yes, this:

TORCH_CHECK(workspace.data_ptr() != nullptr, "OOM trying to allocate workspace for cublaslt");

would be redundant.

janeyx99 · 2024-11-01T21:00:58Z

Does this have performance implications? Like is at::empty overhead going to make this slower than it was before to call cuBlas APIs?

aten/src/ATen/cuda/CUDABlas.cpp

eqy · 2024-11-15T19:07:30Z

Does this have performance implications? Like is at::empty overhead going to make this slower than it was before to call cuBlas APIs?

Did some microbenchmarking finally, seems like the overhead is in the range of 0.5us

#include <torch/torch.h>
#include <torch/cuda.h>
#include <c10/cuda/CUDACachingAllocator.h>

#include <iostream>
#include <chrono>

#define ITER 1000000

int main() {
  auto caching_allocator = c10::cuda::CUDACachingAllocator::get();
  std::cout << "start empty" << std::endl;
  auto t = at::empty({1024}, at::device(at::kCUDA));
  torch::cuda::synchronize();
  std::cout << "empty warmup finished" << std::endl;
  auto t0 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < ITER; i++) {
    t = at::empty({1024}, at::device(at::kCUDA));
  }
  torch::cuda::synchronize();
  auto t1 = std::chrono::high_resolution_clock::now();
  std::cout << "start allocate" << std::endl;
  auto ptr = caching_allocator->allocate(1024*4);
  torch::cuda::synchronize();
  std::cout << "allocate warmup finished" << std::endl;
  auto t2 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < ITER; i++) {
    ptr = caching_allocator->allocate(1024*4);
  }
  torch::cuda::synchronize();
  auto t3 = std::chrono::high_resolution_clock::now();
  auto empty_time = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
  auto allocate_time = std::chrono::duration_cast<std::chrono::duration<double>>(t3 - t2).count();
  std::cout << "empty time per iter: " << empty_time/ITER << std::endl;
  std::cout << "allocate time per iter: " << allocate_time/ITER << std::endl;
}

start empty
empty warmup finished
start allocate
allocate warmup finished
empty time per iter: 7.88432e-07
allocate time per iter: 2.3696e-07

eqy · 2024-11-18T21:35:01Z

@pytorchmergebot rebase

pytorchmergebot · 2024-11-18T21:36:38Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-18T21:36:41Z

Successfully rebased memtrackerlt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout memtrackerlt && git pull --rebase)

eqy · 2024-11-25T21:21:22Z

@pytorchmergebot rebase

pytorchmergebot · 2024-11-25T21:22:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-25T21:22:55Z

Successfully rebased memtrackerlt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout memtrackerlt && git pull --rebase)

eqy · 2024-11-26T19:32:34Z

@pytorchmergebot merge

pytorchmergebot · 2024-11-26T19:34:21Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

albanD

SGTM

janeyx99 · 2024-12-06T18:53:44Z

@pytorchbot merge

pytorchmergebot · 2024-12-06T18:55:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@zdevito

…tensors rather than going to the caching allocator directly (#139442) CC @zdevito @janeyx99 This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause `test_cuda_tracker_equivalence` to fail with a large enough workspace size e.g., `CUBLAS_LT_WORKSPACE_SIZE=32768`. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator. Pull Request resolved: #139442 Approved by: https://github.com/Aidyn-A, https://github.com/albanD, https://github.com/janeyx99

…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel

Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: atalman Differential Revision: D69257102 fbshipit-source-id: 4a2e6391fa899829758596ab2e2f4b16003e5197

…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel

Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: jeanschmidt Differential Revision: D70075331 fbshipit-source-id: cf4d0d687b299c942793a758c6fec4b064c44227

…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel

Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: izaitsevfb Differential Revision: D71711852 fbshipit-source-id: 4f57539b8f37f1f4c92a57c19276e84f81bffa23

…es (pytorch#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also pytorch#120925 + fixes behavior broken behavior with the memtracker; pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: pytorch#145130 Approved by: https://github.com/ngimel

eqy added module: cuda Related to torch.cuda, and CUDA support in general module: cublas Problem related to cublas support topic: not user facing topic category labels Nov 1, 2024

eqy requested a review from syed-ahmed as a code owner November 1, 2024 00:01

pytorchbot added the open source label Nov 1, 2024

eqy force-pushed the memtrackerlt branch from 54be46c to a88dbfa Compare November 1, 2024 00:05

Aidyn-A approved these changes Nov 1, 2024

View reviewed changes

Skylion007 reviewed Nov 2, 2024

View reviewed changes

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

Skylion007 reviewed Nov 2, 2024

View reviewed changes

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

Skylion007 reviewed Nov 2, 2024

View reviewed changes

aten/src/ATen/cuda/CUDABlas.cpp Outdated Show resolved Hide resolved

eqy added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2024

pytorchmergebot force-pushed the memtrackerlt branch from 2622a13 to 0d76f10 Compare November 18, 2024 21:36

eqy added 2 commits November 25, 2024 21:22

update

3056e26

review comments

104eb92

pytorchmergebot force-pushed the memtrackerlt branch from 0d76f10 to 104eb92 Compare November 25, 2024 21:22

pytorchmergebot added the merging label Nov 26, 2024

pytorchmergebot removed the merging label Nov 26, 2024

albanD approved these changes Dec 6, 2024

View reviewed changes

janeyx99 approved these changes Dec 6, 2024

View reviewed changes

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot added the Merged label Dec 6, 2024

pytorchmergebot closed this in cf58de5 Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

eqy mentioned this pull request Jan 18, 2025

[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces #145130

Closed

[cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly #139442

[cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly #139442

Uh oh!

Conversation

eqy commented Nov 1, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139442

✅ No Failures

Uh oh!

eqy commented Nov 1, 2024

Uh oh!

Aidyn-A left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Nov 1, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eqy commented Nov 15, 2024

Uh oh!

eqy commented Nov 18, 2024

Uh oh!

pytorchmergebot commented Nov 18, 2024

Uh oh!

pytorchmergebot commented Nov 18, 2024

Uh oh!

eqy commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

eqy commented Nov 26, 2024

Uh oh!

pytorchmergebot commented Nov 26, 2024

Merge failed

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

eqy commented Nov 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 1, 2024 •

edited

Loading