-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add an env variable to disable addmm_cuda_lt kernel #91436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91436
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d9e2714: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
8b46bdc to
d54aa3e
Compare
This might also mean that cublas is running into a sticky error / corrupt CUDA context and is just the victim. |
Thank you @ptrblck, I hard coded to disable 'addmm_cuda_lt' kernel, the training went through, so I think it should be related to 'addmm_cuda_lt' kernel? I can run with |
d54aa3e to
8d2cd1f
Compare
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
8d2cd1f to
7e28a8a
Compare
|
@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
|
The dynamo unit test failures are not related |
It is internal workload that is hard to rewrite them in OSS though. What I can do is to get extra error logs, so far we can reproduce 3 failure cases for this workload , I added error log inside CUDABlas.cpp when the error is hit: Failure case 1 RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 80 n 1024 k 160 mat1_ld 160 mat2_ld 160 result_ld 80 abcType 14 computeType 68 scaleType 0 result_shape 1024 80 result_stride 80 1 self_shape 80 self_stride 1 mat1_shape 1024 160 mat1_stride 160 1 mat2_shape 160 80 mat2_stride 1 160 Failure case 2: RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 16384 k 48 mat1_ld 48 mat2_ld 48 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 16384 16 result_stride 16 1 self_shape 16 self_stride 1 mat1_shape 16384 48 mat1_stride 48 1 mat2_shape 48 16 mat2_stride 1 48 Failure case 3: RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 327680 k 16 mat1_ld 16 mat2_ld 16 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 327680 16 result_stride 16 1 result continuguous 1 self_shape 16 self_stride 1 self continuguous 1 mat1_shape 327680 16 mat1_stride 16 1 mat1 continuguous 1 mat2_shape 16 16 mat2_stride 1 16 mat2 continuguous 0 Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/[CUDABlas.cpp:849] @ptrblck did you see any common thing for the above 3 failure cases? or any other extra debugging info I can add to help further root cause it? Thanks for your help! |
|
@ptrblck by the way, the extra error log I added is like this after 'cublasStatus_t cublasStatus = cublasLtMatmul(...)' call: TORCH_CHECK( |
|
@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
7e28a8a to
4317980
Compare
|
@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
aten/src/ATen/native/cuda/Blas.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also avoid strcmp on the hot path, do it once.
|
@zhaojuanmao Thanks for the information about your debugging steps! |
4317980 to
5127bd2
Compare
|
@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
sounds great, thanks! |
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
5127bd2 to
d9e2714
Compare
|
@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
addmm_cuda_lt failed for some corner cases, so far we can not reproduce the corner cases in the unit tests, seems that the failures do not only depend on matrices' shape and strides. For now, add an environment variable to allow users disable this kernel for such corner cases.
See the case one with more error logs:
RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 80 n 1024 k 160 mat1_ld 160 mat2_ld 160 result_ld 80 abcType 14 computeType 68 scaleType 0 result_shape 1024 80 result_stride 80 1 self_shape 80 self_stride 1 mat1_shape 1024 160 mat1_stride 160 1 mat2_shape 160 80 mat2_stride 1 160
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):
another case with more error logs:
RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 16384 k 48 mat1_ld 48 mat2_ld 48 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 16384 16 result_stride 16 1 self_shape 16 self_stride 1 mat1_shape 16384 48 mat1_stride 48 1 mat2_shape 48 16 mat2_stride 1 48
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):