Add an env variable to disable addmm_cuda_lt kernel #91436

zhaojuanmao · 2022-12-28T07:52:34Z

addmm_cuda_lt failed for some corner cases, so far we can not reproduce the corner cases in the unit tests, seems that the failures do not only depend on matrices' shape and strides. For now, add an environment variable to allow users disable this kernel for such corner cases.

See the case one with more error logs:

RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 80 n 1024 k 160 mat1_ld 160 mat2_ld 160 result_ld 80 abcType 14 computeType 68 scaleType 0 result_shape 1024 80 result_stride 80 1 self_shape 80 self_stride 1 mat1_shape 1024 160 mat1_stride 160 1 mat2_shape 160 80 mat2_stride 1 160
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):

another case with more error logs:

RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 16384 k 48 mat1_ld 48 mat2_ld 48 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 16384 16 result_stride 16 1 self_shape 16 self_stride 1 mat1_shape 16384 48 mat1_stride 48 1 mat2_shape 48 16 mat2_stride 1 48
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):

pytorch-bot · 2022-12-28T07:52:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91436

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d9e2714:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ptrblck · 2022-12-28T19:37:52Z

addmm_cuda_lt failed for some corner cases, so far we can not reproduce the corner cases in the unit tests, seems that the failures do not only depend on matrices' shape and strides.

This might also mean that cublas is running into a sticky error / corrupt CUDA context and is just the victim.
What are these workloads and did you try to launch them via CUDA_LAUNCH_BLOCKING=1 or in a compute-sanitizer run?

zhaojuanmao · 2022-12-28T20:06:25Z

addmm_cuda_lt failed for some corner cases, so far we can not reproduce the corner cases in the unit tests, seems that the failures do not only depend on matrices' shape and strides.

This might also mean that cublas is running into a sticky error / corrupt CUDA context and is just the victim. What are these workloads and did you try to launch them via CUDA_LAUNCH_BLOCKING=1 or in a compute-sanitizer run?

Thank you @ptrblck, I hard coded to disable 'addmm_cuda_lt' kernel, the training went through, so I think it should be related to 'addmm_cuda_lt' kernel?

I can run with CUDA_LAUNCH_BLOCKING=1 later on, what does compute-sanitizer run mean?

aten/src/ATen/native/cuda/Blas.cpp

zhaojuanmao · 2022-12-29T03:16:43Z

@pytorchbot rebase

pytorchmergebot · 2022-12-29T03:18:56Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-12-29T03:19:02Z

Successfully rebased disableaddmmcudalt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout disableaddmmcudalt && git pull --rebase)

facebook-github-bot · 2022-12-29T03:20:01Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ptrblck · 2022-12-29T03:37:35Z

I can run with CUDA_LAUNCH_BLOCKING=1 later on, what does compute-sanitizer run mean?

compute-sanitizer would allow you to check for memory violations, race conditions etc.
Could you post the workload which creates the issue, so that I could also try to reproduce and debug it?
cublas could certainly fail, but I would like to get more information about the failure case while this workaround is used.

zhaojuanmao · 2022-12-29T04:40:51Z

The dynamo unit test failures are not related

zhaojuanmao · 2022-12-29T04:48:41Z

I can run with CUDA_LAUNCH_BLOCKING=1 later on, what does compute-sanitizer run mean?

compute-sanitizer would allow you to check for memory violations, race conditions etc. Could you post the workload which creates the issue, so that I could also try to reproduce and debug it? cublas could certainly fail, but I would like to get more information about the failure case while this workaround is used.

It is internal workload that is hard to rewrite them in OSS though.

What I can do is to get extra error logs, so far we can reproduce 3 failure cases for this workload , I added error log inside CUDABlas.cpp when the error is hit:

Failure case 1

RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 80 n 1024 k 160 mat1_ld 160 mat2_ld 160 result_ld 80 abcType 14 computeType 68 scaleType 0 result_shape 1024 80 result_stride 80 1 self_shape 80 self_stride 1 mat1_shape 1024 160 mat1_stride 160 1 mat2_shape 160 80 mat2_stride 1 160
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):

Failure case 2:

RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 16384 k 48 mat1_ld 48 mat2_ld 48 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 16384 16 result_stride 16 1 self_shape 16 self_stride 1 mat1_shape 16384 48 mat1_stride 48 1 mat2_shape 48 16 mat2_stride 1 48
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):

Failure case 3:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 327680 k 16 mat1_ld 16 mat2_ld 16 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 327680 16 result_stride 16 1 result continuguous 1 self_shape 16 self_stride 1 self continuguous 1 mat1_shape 327680 16 mat1_stride 16 1 mat1 continuguous 1 mat2_shape 16 16 mat2_stride 1 16 mat2 continuguous 0

Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/[CUDABlas.cpp:849]

@ptrblck did you see any common thing for the above 3 failure cases? or any other extra debugging info I can add to help further root cause it?

Thanks for your help!

zhaojuanmao · 2022-12-29T04:56:44Z

@ptrblck by the way, the extra error log I added is like this after 'cublasStatus_t cublasStatus = cublasLtMatmul(...)' call:

TORCH_CHECK(
cublasStatus == CUBLAS_STATUS_SUCCESS,
"CUDA error: ",
at::cuda::blas::_cublasGetErrorEnum(cublasStatus),
" when calling cublasLtMatmul with transpose_mat1 ",
transpose_mat1,
" transpose_mat2 ",
transpose_mat2,
" m ",
m,
" n ",
n,
" k ",
k,
" mat1_ld ",
mat1_ld,
" mat2_ld ",
mat2_ld,
" result_ld ",
result_ld,
" abcType ",
abcType,
" computeType ",
computeType,
" scaleType ",
scaleType,
" result_shape ",
result_shape.str(),
" result_stride ",
result_stride.str(),
" result continuguous ",
result.is_contiguous(),
" self_shape ",
self_shape.str(),
" self_stride ",
self_stride.str(),
" self continuguous ",
self.is_contiguous(),
" mat1_shape ",
mat1_shape.str(),
" mat1_stride ",
mat1_stride.str(),
" mat1 continuguous ",
mat1.is_contiguous(),
" mat2_shape ",
mat2_shape.str(),
" mat2_stride ",
mat2_stride.str(),
" mat2 continuguous ",
mat2.is_contiguous());

facebook-github-bot · 2022-12-29T06:17:23Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhaojuanmao · 2022-12-29T19:00:46Z

@pytorchbot rebase

pytorchmergebot · 2022-12-29T19:02:39Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-12-29T19:02:44Z

Successfully rebased disableaddmmcudalt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout disableaddmmcudalt && git pull --rebase)

facebook-github-bot · 2022-12-29T19:18:45Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ngimel · 2022-12-29T21:03:50Z

aten/src/ATen/native/cuda/Blas.cpp

you should also avoid strcmp on the hot path, do it once.

ptrblck · 2022-12-29T23:47:40Z

@zhaojuanmao Thanks for the information about your debugging steps!
I understand it might not be trivial to write a minimal code snippet which reproduces the issue, but any information you could share might help me trying to reproduce it (also feel free to ping me on Slack in case you would prefer it).
So far, I would add a sync and a CUDA error check at gemm_and_bias before any cublas call is invoked to check if a sticky error is already reported.
I don't want to hijack this PR for the debugging discussion so let's follow up on Slack or in a new issue (in case you can share any more information about the use case and error).

facebook-github-bot · 2022-12-30T00:06:27Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhaojuanmao · 2022-12-30T00:08:01Z

@zhaojuanmao Thanks for the information about your debugging steps! I understand it might not be trivial to write a minimal code snippet which reproduces the issue, but any information you could share might help me trying to reproduce it (also feel free to ping me on Slack in case you would prefer it). So far, I would add a sync and a CUDA error check at gemm_and_bias before any cublas call is invoked to check if a sticky error is already reported. I don't want to hijack this PR for the debugging discussion so let's follow up on Slack or in a new issue (in case you can share any more information about the use case and error).

sounds great, thanks!

zhaojuanmao · 2023-01-03T17:14:30Z

@pytorchbot rebase

pytorchmergebot · 2023-01-03T17:16:17Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-01-03T17:16:22Z

Successfully rebased disableaddmmcudalt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout disableaddmmcudalt && git pull --rebase)

facebook-github-bot · 2023-01-03T18:42:52Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhaojuanmao · 2023-01-03T21:12:37Z

@pytorchbot merge

pytorchmergebot · 2023-01-03T21:14:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

zhaojuanmao requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, rohan-varma and wanchaol as code owners December 28, 2022 07:52

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 28, 2022

zhaojuanmao requested review from jspark1105 and ngimel December 28, 2022 07:52

zhaojuanmao force-pushed the disableaddmmcudalt branch 2 times, most recently from 8b46bdc to d54aa3e Compare December 28, 2022 19:07

zhaojuanmao requested a review from ptrblck December 28, 2022 19:08

ngimel reviewed Dec 28, 2022

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Outdated Show resolved Hide resolved

ngimel reviewed Dec 28, 2022

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Outdated Show resolved Hide resolved

zhaojuanmao force-pushed the disableaddmmcudalt branch from d54aa3e to 8d2cd1f Compare December 28, 2022 23:09

pytorchmergebot force-pushed the disableaddmmcudalt branch from 8d2cd1f to 7e28a8a Compare December 29, 2022 03:19

pytorchmergebot force-pushed the disableaddmmcudalt branch from 7e28a8a to 4317980 Compare December 29, 2022 19:02

ngimel reviewed Dec 29, 2022

View reviewed changes

zhaojuanmao force-pushed the disableaddmmcudalt branch from 4317980 to 5127bd2 Compare December 30, 2022 00:04

add an env variable to disable addmm_cuda_lt kernel

d9e2714

pytorchmergebot force-pushed the disableaddmmcudalt branch from 5127bd2 to d9e2714 Compare January 3, 2023 17:16

zhaojuanmao requested a review from ngimel January 3, 2023 18:08

ngimel approved these changes Jan 3, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2023

pytorchmergebot added the Merged label Jan 4, 2023

pytorchmergebot closed this in e116f1a Jan 4, 2023

github-actions bot deleted the disableaddmmcudalt branch July 5, 2024 01:53

Add an env variable to disable addmm_cuda_lt kernel #91436

Add an env variable to disable addmm_cuda_lt kernel #91436

Uh oh!

Conversation

zhaojuanmao commented Dec 28, 2022

Uh oh!

pytorch-bot bot commented Dec 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91436

✅ No Failures

Uh oh!

ptrblck commented Dec 28, 2022

Uh oh!

zhaojuanmao commented Dec 28, 2022

Uh oh!

Uh oh!

Uh oh!

zhaojuanmao commented Dec 29, 2022

Uh oh!

pytorchmergebot commented Dec 29, 2022

Uh oh!

pytorchmergebot commented Dec 29, 2022

Uh oh!

facebook-github-bot commented Dec 29, 2022

Uh oh!

ptrblck commented Dec 29, 2022

Uh oh!

zhaojuanmao commented Dec 29, 2022

Uh oh!

zhaojuanmao commented Dec 29, 2022

Uh oh!

zhaojuanmao commented Dec 29, 2022

Uh oh!

facebook-github-bot commented Dec 29, 2022

Uh oh!

zhaojuanmao commented Dec 29, 2022

Uh oh!

pytorchmergebot commented Dec 29, 2022

Uh oh!

pytorchmergebot commented Dec 29, 2022

Uh oh!

facebook-github-bot commented Dec 29, 2022

Uh oh!

ngimel Dec 29, 2022

Choose a reason for hiding this comment

Uh oh!

ptrblck commented Dec 29, 2022

Uh oh!

facebook-github-bot commented Dec 30, 2022

Uh oh!

zhaojuanmao commented Dec 30, 2022

Uh oh!

zhaojuanmao commented Jan 3, 2023

Uh oh!

pytorchmergebot commented Jan 3, 2023

Uh oh!

pytorchmergebot commented Jan 3, 2023

Uh oh!

facebook-github-bot commented Jan 3, 2023

Uh oh!

zhaojuanmao commented Jan 3, 2023

Uh oh!

pytorchmergebot commented Jan 3, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Dec 28, 2022 •

edited

Loading