Skip to content

torch.mm gives wrong results on certain combinations of input size, device type, and cuBLAS version. #22078

@umanwizard

Description

@umanwizard

🐛 Bug

To Reproduce

Ensure that you are running on CUDA 9.0 on a Maxwell or Pascal device, and execute the following script:

import torch, sys
print('Torch version:', torch.__version__)
print('sys.version', sys.version)
for n_rows in [
        0b01000000000000000000001,
        0b01000000000000000000010,
        0b10000010010000010001110,
        0b11000000000000000000000
        ]:
    a=torch.ones(n_rows,2).float().cuda()
    b=torch.ones(2,2).float().cuda()
    print((torch.mm(a, b) - 2).abs().max().item())

If the script prints anything other than "0.0" on any line, you have reproduced the bug.

To reproduce the issue in pure cuBLAS code (with no reference to PyTorch), see this gist: https://gist.github.com/umanwizard/2b2e2fc12485ef6dc1cdfb1421276dd9

Environment

The following are known to be necessary conditions for the issue:

  1. cuBLAS version is less than 9.2
  2. At least one dimension of the matrix is larger than 2^21
  3. Device architecture is Maxwell or Pascal
  4. Data type is float or half

Other than that, we don't know the exact conditions under which it triggers.

cc @ezyang @gchanan @zou3519

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cublasProblem related to cublas supporttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions