Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input #42425

IvanYashchuk · 2020-08-02T17:25:39Z

The problem was that the non-contiguous batched matrices were passed to gemmStridedBatched.

The following code fails on master and works with the proposed patch:

import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.einsum('...ab,...bc->...ac', c, c)

dr-ci · 2020-08-02T19:41:30Z

💊 CI failures summary and remediations

As of commit cfff0b4 (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_doc_test (1/1)

Step: "Doc test" (full log | diagnosis details | 🔁 rerun)

Aug 04 16:49:42 caused by: Connection refused (os error 111)

Aug 04 16:49:41 +++++ extract_trap_cmd 
Aug 04 16:49:41 +++++ printf '%s\n' '' 
Aug 04 16:49:41 ++++ printf '%s\n' cleanup 
Aug 04 16:49:41 +++ trap -- ' 
Aug 04 16:49:41 cleanup' EXIT 
Aug 04 16:49:41 +++ [[ pytorch-doc-test != *pytorch-win-* ]] 
Aug 04 16:49:41 +++ which sccache 
Aug 04 16:49:41 +++ sccache --stop-server 
Aug 04 16:49:42 Stopping sccache server... 
Aug 04 16:49:42 error: couldn't connect to server 
Aug 04 16:49:42 caused by: Connection refused (os error 111) 
Aug 04 16:49:42 +++ true 
Aug 04 16:49:42 +++ rm /var/lib/jenkins/sccache_error.log 
Aug 04 16:49:42 +++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Aug 04 16:49:42 +++ SCCACHE_IDLE_TIMEOUT=1200 
Aug 04 16:49:42 +++ RUST_LOG=sccache::server=error 
Aug 04 16:49:42 +++ sccache --start-server 
Aug 04 16:49:42 Starting sccache server... 
Aug 04 16:49:42 +++ sccache --zero-stats 
Aug 04 16:49:42 Compile requests                 0 
Aug 04 16:49:42 Compile requests executed        0

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_python_doc_build}	^{Doc Build and Push}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 32 times.

ngimel · 2020-08-03T02:19:11Z

Thank you, the fix looks good! Please add the test to the test suite. There's test_baddbmm in test_torch.py, but for some reason it's enabled only on the CPU. Can you try enabling it on cuda and adding a testcase for the behavior your enabled there?

IvanYashchuk · 2020-08-03T05:20:25Z

Sure, I will do that.

IvanYashchuk · 2020-08-03T10:07:59Z

While writing tests I found that torch.mm also doesn't work with this kind of strided input

import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.mm(c[0], c[0]) # Fails

The same problem was in ATen/native/cuda/LinearAlgebra.cu:prepare_matrix_for_cublas as in THCTensor_(baddbmm) that the input was not transformed into a contiguous array. I have fixed that.
I've added the tests both for pytorch.mm and pytorch.bmm. Correctness is checked via comparing to the NumPy result.

ngimel

Thank you, this looks great! I have a small suggestion about the test.

test/test_torch.py

ngimel

Awesome, thanks!

ngimel · 2020-08-04T05:50:23Z

flake8 error is real

IvanYashchuk · 2020-08-04T05:52:15Z

flake8 error is real

Is it okay to ignore E731 because of assigned lambdas instead of def in the test?

ngimel · 2020-08-04T05:57:21Z

Yeah, that's fine, you could also use lambdas directly as arguments, but it does not matter. Sorry to ask, but can you please rebase? We had an issue with docker images today, so CI on this PR is failing because it is against a bad base, and can't find docker images.

IvanYashchuk · 2020-08-04T06:11:32Z

Alright, I did something wrong here. I've rebased onto master and pushed to branch and PR was automatically closed. I'll try to fix that.

ngimel · 2020-08-04T06:14:12Z

It somehow became 0-commit 0-line pull request, that's probably why it was closed. Something wrong with the rebase?

IvanYashchuk · 2020-08-04T06:29:23Z

I've recovered the branch. I am sorry for the inconvenience. I've re-opened the PR.

The problem was that the wrong codepath was taken with unusual strides and not contiguous input in batched matrix multiplication.

…e compared to NumPy

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2020-08-04T22:46:49Z

Thank you!

facebook-github-bot · 2020-08-05T00:14:43Z

@ngimel merged this pull request in b9e68e0.

pytorchbot added the open source label Aug 2, 2020

IvanYashchuk changed the title ~~Fix the bug in THCTensor_(baddbmm) for strided views input~~ Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input Aug 3, 2020

IvanYashchuk force-pushed the fix-issue-42418 branch from bf007be to aba7652 Compare August 3, 2020 10:11

ngimel reviewed Aug 4, 2020

View reviewed changes

test/test_torch.py Outdated Show resolved Hide resolved

ngimel approved these changes Aug 4, 2020

View reviewed changes

IvanYashchuk closed this Aug 4, 2020

IvanYashchuk force-pushed the fix-issue-42418 branch from ddf1e4c to 8427595 Compare August 4, 2020 06:09

IvanYashchuk reopened this Aug 4, 2020

IvanYashchuk added 5 commits August 4, 2020 19:21

Fixed issue pytorch#42418.

d46bb43

The problem was that the wrong codepath was taken with unusual strides and not contiguous input in batched matrix multiplication.

Fixed a bug in prepare_matrix_for_cublas function used for torch.mm

cccfcd9

Added tests for torch.mm and torch.bmm with strided input, results ar…

0fbb7cd

…e compared to NumPy

Use compare_with_numpy in tests

d76c02f

noqa: E731 on lambdas in the test

cfff0b4

IvanYashchuk force-pushed the fix-issue-42418 branch from 31e5749 to cfff0b4 Compare August 4, 2020 16:22

facebook-github-bot reviewed Aug 4, 2020

View reviewed changes

facebook-github-bot closed this in b9e68e0 Aug 4, 2020

facebook-github-bot added the merged label Aug 5, 2020

IvanYashchuk deleted the fix-issue-42418 branch August 8, 2020 06:50

mruberry added the Merged label Oct 28, 2020

Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input #42425

Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input #42425

Uh oh!

Conversation

IvanYashchuk commented Aug 2, 2020

Uh oh!

dr-ci bot commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_doc_test (1/1)

1 failure not recognized by patterns:

Uh oh!

ngimel commented Aug 3, 2020

Uh oh!

IvanYashchuk commented Aug 3, 2020

Uh oh!

IvanYashchuk commented Aug 3, 2020

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Aug 4, 2020

Uh oh!

IvanYashchuk commented Aug 4, 2020

Uh oh!

ngimel commented Aug 4, 2020

Uh oh!

IvanYashchuk commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Aug 4, 2020

Uh oh!

IvanYashchuk commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Aug 4, 2020

Uh oh!

facebook-github-bot commented Aug 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dr-ci bot commented Aug 2, 2020 •

edited

Loading

IvanYashchuk commented Aug 4, 2020 •

edited

Loading

IvanYashchuk commented Aug 4, 2020 •

edited

Loading