Skip to content

Conversation

@dzdang
Copy link
Contributor

@dzdang dzdang commented Mar 9, 2022

Stack from ghstack (oldest at bottom):

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:

python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn

Differential Revision: D34824251

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 9, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/1d1d4eb0647ae79cdbe7bcfa9d066703a875b31d/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-manywheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
windows-binary-libtorch-debug ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-libtorch-release ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-wheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 9, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 91511fd (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

dzdang added a commit that referenced this pull request Mar 9, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

ghstack-source-id: edd733d
Pull Request resolved: #73959
@dzdang dzdang marked this pull request as draft March 9, 2022 18:34
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 10, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

ghstack-source-id: b8f25a2
Pull Request resolved: #73959
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 11, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

ghstack-source-id: 4423fcf
Pull Request resolved: #73959
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 11, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

ghstack-source-id: 5655128
Pull Request resolved: #73959
@jerryzh168
Copy link
Contributor

Please add a Test Plan for this PR as well

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 11, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

ghstack-source-id: 3bba699
Pull Request resolved: #73959
@dzdang
Copy link
Contributor Author

dzdang commented Mar 11, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…n [PR currently incomplete]"

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.

Test plan:

Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251)

[ghstack-poisoned]
@dzdang
Copy link
Contributor Author

dzdang commented Mar 29, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 29, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

ghstack-source-id: 2e4a852
Pull Request resolved: #73959
@dzdang
Copy link
Contributor Author

dzdang commented Mar 29, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dzdang dzdang requested a review from jerryzh168 March 29, 2022 21:41
// we need to add trailing dimensions in order to properly broadcast bias, otherwise broadcast_to will fail.
// the number of trailling dimensions is quantized_output.dim() - 2, so the new size of the broadcast_bias
// becomes quantized_output.dim() - 2 + 1. nothing needs to be done for the leading dimensions
std::vector<int64_t> new_size(quantized_output.dim() - 1, 1);
Copy link
Contributor

@jerryzh168 jerryzh168 Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is the call ndim?
  2. I feel maybe just match the dimension is cleaner, i.e. create a new_size to have the same dimension as quantized_output, and set new_size[1] = the expected dimension

auto weight_fp = weight_transposed.int_repr().to(at::kFloat);

auto run = [&](cudnn_frontend::ManagedOpaqueDescriptor plan_desc) {
auto workspace_size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel in general we have a lot of bolierplate code, maybe we can think about creating some helper functions or easier abstractions to make this simpler, this will be helpful when we have more ops in cudnn

// .setbMatDesc(cudnn_utils::getTensorDescriptor(orig_weight.sizes(), orig_weight.strides(), CUDNN_DATA_FLOAT, 'w', key.weight_alignment))
.setbMatDesc(cudnn_utils::getTensorDescriptor(weight_fp.sizes(), weight_fp.strides(), CUDNN_DATA_FLOAT, 'w', key.weight_alignment))
.setcMatDesc(cudnn_utils::getTensorDescriptor(linear_output, 'y', key.output_alignment))
.setmatmulDesc(getLinearDescriptor(CUDNN_DATA_FLOAT)) // is this right? should it be float?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we have a table for the descriptor data type, maybe we can implement that in a function: getting descriptor data type from input data type

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, had some nit comments inline

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 31, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

ghstack-source-id: 586d55e
Pull Request resolved: #73959
@dzdang
Copy link
Contributor Author

dzdang commented Mar 31, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251)

[ghstack-poisoned]
@dzdang
Copy link
Contributor Author

dzdang commented Mar 31, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251)

[ghstack-poisoned]
@dzdang
Copy link
Contributor Author

dzdang commented Mar 31, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251)

[ghstack-poisoned]
dzdang added a commit that referenced this pull request Mar 31, 2022
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```

ghstack-source-id: 0a2c6c1
Pull Request resolved: #73959
@dzdang
Copy link
Contributor Author

dzdang commented Mar 31, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Apr 1, 2022
Summary:
Pull Request resolved: #73959

This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.

Test Plan:
```
python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn
```
Imported from OSS

Differential Revision:
D34824251
D34824251

Reviewed By: jerryzh168

Pulled By: dzdang

fbshipit-source-id: 47139796782ade8d030ba2f9968a9abdd3a91d2f
@facebook-github-bot facebook-github-bot deleted the gh/dzdang/51/head branch April 5, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants