-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[quant][gpu][core] Added quantized linear operator in cudnn #73959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 91511fd (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. ghstack-source-id: edd733d Pull Request resolved: #73959
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. ghstack-source-id: b8f25a2 Pull Request resolved: #73959
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. ghstack-source-id: 4423fcf Pull Request resolved: #73959
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. ghstack-source-id: 5655128 Pull Request resolved: #73959
|
Please add a |
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. ghstack-source-id: 3bba699 Pull Request resolved: #73959
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…n [PR currently incomplete]" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Test plan: Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` ghstack-source-id: 2e4a852 Pull Request resolved: #73959
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
| // we need to add trailing dimensions in order to properly broadcast bias, otherwise broadcast_to will fail. | ||
| // the number of trailling dimensions is quantized_output.dim() - 2, so the new size of the broadcast_bias | ||
| // becomes quantized_output.dim() - 2 + 1. nothing needs to be done for the leading dimensions | ||
| std::vector<int64_t> new_size(quantized_output.dim() - 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- is the call
ndim? - I feel maybe just match the dimension is cleaner, i.e. create a new_size to have the same dimension as quantized_output, and set new_size[1] = the expected dimension
| auto weight_fp = weight_transposed.int_repr().to(at::kFloat); | ||
|
|
||
| auto run = [&](cudnn_frontend::ManagedOpaqueDescriptor plan_desc) { | ||
| auto workspace_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel in general we have a lot of bolierplate code, maybe we can think about creating some helper functions or easier abstractions to make this simpler, this will be helpful when we have more ops in cudnn
| // .setbMatDesc(cudnn_utils::getTensorDescriptor(orig_weight.sizes(), orig_weight.strides(), CUDNN_DATA_FLOAT, 'w', key.weight_alignment)) | ||
| .setbMatDesc(cudnn_utils::getTensorDescriptor(weight_fp.sizes(), weight_fp.strides(), CUDNN_DATA_FLOAT, 'w', key.weight_alignment)) | ||
| .setcMatDesc(cudnn_utils::getTensorDescriptor(linear_output, 'y', key.output_alignment)) | ||
| .setmatmulDesc(getLinearDescriptor(CUDNN_DATA_FLOAT)) // is this right? should it be float? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we have a table for the descriptor data type, maybe we can implement that in a function: getting descriptor data type from input data type
jerryzh168
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, had some nit comments inline
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` ghstack-source-id: 586d55e Pull Request resolved: #73959
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` ghstack-source-id: 0a2c6c1 Pull Request resolved: #73959
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: #73959 This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test Plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Imported from OSS Differential Revision: D34824251 D34824251 Reviewed By: jerryzh168 Pulled By: dzdang fbshipit-source-id: 47139796782ade8d030ba2f9968a9abdd3a91d2f
Stack from ghstack (oldest at bottom):
Summary:
This PR is similar to #70622, but for the linear operator.
Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator,
and also directly implements bias & relu.
Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As
a temporary workaround, we cast our int8 tensors to fp32 prior to matmul.
Test plan:
Differential Revision: D34824251