Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Jul 21, 2020

fixes #41340

Unfortunately, I still can not get a K80 to verify the fix, but it should be working.

@dr-ci
Copy link

dr-ci bot commented Jul 21, 2020

💊 CI failures summary and remediations

As of commit 9dd5bad (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

@zasdfgbnm zasdfgbnm changed the title [WIP] Improve zero sized input for addmv Improve zero sized input for addmv Jul 21, 2020
@zasdfgbnm zasdfgbnm requested a review from ngimel July 21, 2020 23:28
@ngimel
Copy link
Collaborator

ngimel commented Jul 22, 2020

I don't think this is the right fix, because the root cause of that failing test case was fp16 gemm for fp16, not fp16 gemmv

device="cuda"
dtype=torch.float16
mat = torch.ones((2, 0), dtype=dtype, device=device)
vec = torch.ones((0,), dtype=dtype, device=device)
print(torch.mm(mat,vec.view(0,1))) #errors out on K80

This fix will mask gemm failure because gemv for degenerate case won't go through gemm, but gemm failure will still remain.

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Jul 22, 2020

(Just realize that I put the wrong issue number. It is fixed now)

@ngimel I think both addmm and addmv are wrong? And in the issue (#41340), the author says addmv:

Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 777, in wrapper
    method(*args, **kwargs)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 241, in instantiated_test
    result = test(self, device_arg, dtype)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 411, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test_torch.py", line 13909, in test_blas_alpha_beta_empty
    torch.addmv(input=input, mat=mat, vec=vec, alpha=alpha, beta=beta))
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1080, in assertEqual
    exact_device=exact_device)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 971, in _compareTensors
    return _compare_tensors_internal(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan)
  File "/tmp/easybuild-tmp/eb-1Ebm0K/tmpcR9xV8/lib/python3.7/site-packages/torch/testing/__init__.py", line 122, in _compare_tensors_internal
    if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan):
RuntimeError: CUDA error: an illegal memory access was encountered

@ngimel
Copy link
Collaborator

ngimel commented Jul 22, 2020

Yeah, the failing test is addmv. Addmv/mv call at::cuda::blas::gemv, however, if you remember, for fp16 at::cuda::blas::gemv does not call cublas gemv (because there's no such thing), instead it calls at:cuda::blas::gemm, which, in turn, calls cublasSgemmEx, which finally throws an error. fp16 mm/addmm also call cublasSgemmEx, so even though the empty calls with them are not tested in the test suite, they would still be affected, as demonstrated in the snippet above. So both mv and mm are affected with the same rootcause.

@mrshenli mrshenli added module: operators triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 23, 2020
@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Jul 28, 2020

@ngimel now addmm is fixed too

@zasdfgbnm
Copy link
Collaborator Author

ping @ngimel

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Aug 17, 2020

Thank you for the fix!

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in aef2890.

zasdfgbnm added a commit that referenced this pull request Aug 29, 2020
zasdfgbnm added a commit that referenced this pull request Aug 29, 2020
ngimel pushed a commit to ngimel/pytorch that referenced this pull request Aug 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in test suite: an illegal memory access was encountered

7 participants