Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Feb 11, 2022

Stack from ghstack (oldest at bottom):

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: D34176407

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 11, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 693aad7 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 11, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/6b2a0788b0c9aef46f553d84f2f759a4a180c8bc/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk, ciflow/xla ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
@vadimkantorov
Copy link
Contributor

vadimkantorov commented Feb 11, 2022

Fixes this related #39661 ? (except cublasLt fused matmul + bias)

swolchok added a commit that referenced this pull request Feb 11, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 148949323

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Feb 19, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 149546045

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Feb 22, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 149656238

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Feb 24, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 149879778

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Feb 25, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 149980807

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Feb 26, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 150032112

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Mar 1, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 150236565

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Mar 7, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 150736402

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Mar 8, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 150813712

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
… input"

If the input is 3D and contiguous, we can get a fused addmm by reshaping.

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Mar 25, 2022
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 152278479

Differential Revision: [D34176407](https://our.internmc.facebook.com/intern/diff/D34176407/)
facebook-github-bot pushed a commit that referenced this pull request Mar 28, 2022
)

Summary:
Pull Request resolved: #72728

If the input is 3D and contiguous, we can get a fused addmm by reshaping.
ghstack-source-id: 152278479

Test Plan: existing tests?

Reviewed By: zrphercule

Differential Revision: D34176407

fbshipit-source-id: 899f216cadcd782c3b1b046025228df04228c740
@github-actions
Copy link
Contributor

Hey @swolchok.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@swolchok swolchok added topic: performance topic category release notes: cuda release notes category labels Mar 29, 2022
@facebook-github-bot facebook-github-bot deleted the gh/swolchok/458/head branch April 1, 2022 14:17
pytorchmergebot pushed a commit that referenced this pull request Jan 21, 2023
…puts (#92201)

Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214

Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200.

Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called.

We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now.

CC @ptrblck @ngimel
Pull Request resolved: #92201
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed release notes: cuda release notes category topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants