-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add ONEDNN quantization backend #69820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
5a30331 to
6c45802
Compare
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 8a40b8c (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
ebf6c82 to
0543f13
Compare
|
Hi @jerryzh168 @VitalyFedyunin please review this PR. Thanks. |
|
The failure does not seem to be caused by this patch |
ac072a4 to
07777f7
Compare
…e=True` in qconfig for unit test TestQuantizedOps.test_custom_module_multi_head_attention. Skip unsupported tests (output padding for deconv)
07777f7 to
8bfa256
Compare
|
Now all checks have passed. Please review. Thanks. |
| # ONEDNN only supports symmetric quantization of weight | ||
| if torch.backends.quantized.engine == 'onednn': | ||
| W_q = torch.quantize_per_tensor(W, 0.1, 0, torch.qint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would L100 error out since it's not symmetric quantization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can select scale/zero_point based on qengine instead of hardcode them here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think L100 is OK.
maybe we can select scale/zero_point based on qengine instead of hardcode them here
Do you mean something like this?
zp_weight = 0 if qengine_is_onednn() else torch.randint(1, 10, (1,)).item()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly
| X_scale, X_zero_point, W_scale, W_zero_point, Y_scale, Y_zero_point, | ||
| use_bias, use_fused, use_channelwise): | ||
| # ONEDNN only supports symmetric quantization of weight | ||
| if torch.backends.quantized.engine == 'onednn' and not all(zp == 0 for zp in W_zero_point): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we select scale/zero_point based on qengine we won't need this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to select proper weight scale/zp by if ... else .... in each unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this function is called from: https://github.com/pytorch/pytorch/blob/master/test/quantization/core/test_quantized_module.py#L387, we can generate the scale/zp based on the engine. the current implementation will just skip the check I think
| W = torch.rand(out_features, in_features).float() | ||
| W_scale, W_zp = _calculate_dynamic_qparams(W, torch.qint8) | ||
| # ONEDNN only supports symmetric quantization of weight | ||
| if torch.backends.quantized.engine == 'onednn' and W_zp != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here weight scale and zero point are calculated not selected manually. Do you mean we need a new function to calculate weight scale/zero point for symmetric quantization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so, we can add a qscheme argument to https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_quantized.py#L49 to support symmetric quantization and set the qscheme to symmetric quantization when the qengine is mkldnn/onednn
| # ONEDNN only supports symmetric quantization of weight | ||
| if torch.backends.quantized.engine == 'onednn': | ||
| W_zps = np.zeros(output_channels).astype(np.int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can put L3211 in the else branch
…ackend only supports symmetric quantization of weight
|
I did a rebase and looks like there is still errors, I think the third-party import is probably still not done yet |
|
Hi @jerryzh168, thanks for the update. Could you please remind @frank-wei to import? Thanks. |
|
Hi @Xia-Weiwen can you resolve the merge conflict? @frank-wei just finished the update of ideep library, we can import the PR now. |
|
Hi @jerryzh168 Conflict resolved. Please move on. Thanks. |
|
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hi @Xia-Weiwen, I have imported again. and looks like there are some other errors. there are other lint warnings as well and I'm not sure what is the best way to communicate them. but I feel it might be OK to fix them later |
|
Hi @jerryzh168 Thanks for the update. Are they all warnings of unused variables? Do you think it's better to fix them now or later in another PR? Anyway, could you please provide a log of these warnings so I can fix them? |
Hi @Xia-Weiwen, unfortunately there is no easy way to export those warnings right now. Can you just fix the blocking ones for now? the one I pasted in the previous comment. I think we can leave the rest there for now. We're also moving to Github First soon, hopefully these problems can be addressed during that move. |
|
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hi @Xia-Weiwen I can confirm there is no more internal errors now. but looks like there is some new merge conflict, can you help resolve them? I think we should be able to land after that |
|
@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: This PR adds a new quantization backend, ONEDNN, with quantized conv and linear kernels in the same code path as the FBGEMM backend The ONEDNN backend is an alternative of FBGEMM and QNNPACK backends. It takes advantage of features of the latest Intel® CPU products. It supports VNNI on Cascade Lake and the AMX instruction set to be available on Sapphire Rapids which has 8X int8 peak TOPS over VNNI. ONEDNN demonstrates better performance on conv kernels of popular CNN models than FBGEMM. It also supports more fused ops, such as convolution-add-ReLU, than FBGEMM and QNNPACK. To use this backend, users only need to set the quantization backend to 'onednn' before any calculation without a single change to models. ```python torch.backends.quantized.engine = 'onednn' ``` ## Design docs #21120 (comment) #67177 (comment) ## File changes **Add ONEDNN to qengine list** - aten/src/ATen/Context.cpp - c10/core/QEngine.h - torch/ao/quantization/qconfig.py - torch/backends/quantized/\_\_init\_\_.py **Implement qconv & qlinear for ONEDNN backend** - aten/src/ATen/native/quantized/cpu/conv_serialization.h - aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp - aten/src/ATen/native/quantized/cpu/onednn_utils.h - aten/src/ATen/native/quantized/cpu/qconv.cpp - aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp - aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp - aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp - aten/src/ATen/native/quantized/cpu/qlinear.cpp - aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp - aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp - aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp **Skip tests that are not supported by ONEDNN** - test/ao/sparsity/test_kernels.py - test/quantization/core/test_quantized_module.py - test/quantization/core/test_quantized_op.py ## Validation results This PR has passed `test_quantization.py` and `test_mkldnn.py`. Below are performance data of int8 2d convolution and linear on the Cascade Lake Xeon® platform: (Note: Tested with single instance on single core. Using the latest oneDNN library.) **Table 1. Performance comparison of int8 2d convolution operator** |No.| Shape| FBGEMM| ONEDNN| Gain| |-|-|-|-|-| |1| IC=128, OC=128, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 668.310us| 535.630us| 24.8%| |2| IC=128, OC=128, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 290.630us| 281.810us| 3.1%| |3| IC=128, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 1.045ms| 893.010us| 17.0%| |4| IC=128, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 385.320us| 373.720us| 3.1%| |5| IC=256, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0| 1.876ms| 1.641ms| 14.3%| |6| IC=256, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0| 660.460us| 638.470us| 3.4%| **Table 2. Performance comparison of int8 linear operator** |No.| Shape (m, n, k)| FBGEMM| ONEDNN| Gap| |-|-|-|-|-| |1| 64, 800, 320| 80.550us| 96.770us| 20.10%| |2| 64, 768, 512| 101.230us| 130.720us| 29.10%| |3| 16, 256, 512| 30.230us| 51.450us| 70.20%| |4| 128, 128, 128| 33.810us| 50.480us| 49.30%| |5| 256, 512, 256| 154.490us| 195.050us| 26.30%| |6| 1024, 1024, 1024| 3.134ms| 3.514ms| 12.10%| ONEDNN showed advantages over FBGEMM for convolution. However, it has performance gap to FBGEMM for Linear ops. The gap is a known issue and further optimization is in progress in the oneDNN library. On the latest platforms, better performance of ONEDNN is achieved for both conv and linear. Pull Request resolved: #69820 Reviewed By: HDCharles Differential Revision: D33716039 Pulled By: jerryzh168 fbshipit-source-id: 6f7bb807e85798142dfcffccfca8b8bd652fb3dd
|
Hey @Xia-Weiwen. |
| #include <ATen/Config.h> | ||
| #if AT_MKLDNN_ENABLED() | ||
| #include <ATen/Tensor.h> | ||
| #include <ATen/native/quantized/cpu/conv_packed_params.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Xia-Weiwen, I landed the PR but looks like this line is not up to date. we should remove this line. I'm reverting the change right now, can you help recreate the PR after this is reverted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jerryzh168, I created a new PR #74137. Please take a look. Thanks.
|
This pull request has been reverted by 5a89753. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
|
This pull request has been reverted by 5a89753. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
Summary
This PR adds a new quantization backend, ONEDNN, with quantized conv and linear kernels in the same code path as the FBGEMM backend
The ONEDNN backend is an alternative of FBGEMM and QNNPACK backends. It takes advantage of features of the latest Intel® CPU products. It supports VNNI on Cascade Lake and the AMX instruction set to be available on Sapphire Rapids which has 8X int8 peak TOPS over VNNI.
ONEDNN demonstrates better performance on conv kernels of popular CNN models than FBGEMM. It also supports more fused ops, such as convolution-add-ReLU, than FBGEMM and QNNPACK.
To use this backend, users only need to set the quantization backend to 'onednn' before any calculation without a single change to models.
Design docs
#21120 (comment)
#67177 (comment)
File changes
Add ONEDNN to qengine list
Implement qconv & qlinear for ONEDNN backend
Skip tests that are not supported by ONEDNN
Validation results
This PR has passed
test_quantization.pyandtest_mkldnn.py.Below are performance data of int8 2d convolution and linear on the Cascade Lake Xeon® platform:
(Note: Tested with single instance on single core. Using the latest oneDNN library.)
Table 1. Performance comparison of int8 2d convolution operator
Table 2. Performance comparison of int8 linear operator
ONEDNN showed advantages over FBGEMM for convolution. However, it has performance gap to FBGEMM for Linear ops. The gap is a known issue and further optimization is in progress in the oneDNN library. On the latest platforms, better performance of ONEDNN is achieved for both conv and linear.