Skip to content

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Dec 13, 2021

Summary

This PR adds a new quantization backend, ONEDNN, with quantized conv and linear kernels in the same code path as the FBGEMM backend

The ONEDNN backend is an alternative of FBGEMM and QNNPACK backends. It takes advantage of features of the latest Intel® CPU products. It supports VNNI on Cascade Lake and the AMX instruction set to be available on Sapphire Rapids which has 8X int8 peak TOPS over VNNI.

ONEDNN demonstrates better performance on conv kernels of popular CNN models than FBGEMM. It also supports more fused ops, such as convolution-add-ReLU, than FBGEMM and QNNPACK.
To use this backend, users only need to set the quantization backend to 'onednn' before any calculation without a single change to models.

torch.backends.quantized.engine = 'onednn'

Design docs

#21120 (comment)
#67177 (comment)

File changes

Add ONEDNN to qengine list

  • aten/src/ATen/Context.cpp
  • c10/core/QEngine.h
  • torch/ao/quantization/qconfig.py
  • torch/backends/quantized/__init__.py

Implement qconv & qlinear for ONEDNN backend

  • aten/src/ATen/native/quantized/cpu/conv_serialization.h
  • aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
  • aten/src/ATen/native/quantized/cpu/onednn_utils.h
  • aten/src/ATen/native/quantized/cpu/qconv.cpp
  • aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
  • aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
  • aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp
  • aten/src/ATen/native/quantized/cpu/qlinear.cpp
  • aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
  • aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
  • aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp

Skip tests that are not supported by ONEDNN

  • test/ao/sparsity/test_kernels.py
  • test/quantization/core/test_quantized_module.py
  • test/quantization/core/test_quantized_op.py

Validation results

This PR has passed test_quantization.py and test_mkldnn.py.
Below are performance data of int8 2d convolution and linear on the Cascade Lake Xeon® platform:
(Note: Tested with single instance on single core. Using the latest oneDNN library.)

Table 1. Performance comparison of int8 2d convolution operator

No. Shape FBGEMM ONEDNN Gain
1 IC=128, OC=128, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0 668.310us 535.630us 24.8%
2 IC=128, OC=128, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0 290.630us 281.810us 3.1%
3 IC=128, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0 1.045ms 893.010us 17.0%
4 IC=128, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0 385.320us 373.720us 3.1%
5 IC=256, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0 1.876ms 1.641ms 14.3%
6 IC=256, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0 660.460us 638.470us 3.4%

Table 2. Performance comparison of int8 linear operator

No. Shape (m, n, k) FBGEMM ONEDNN Gap
1 64, 800, 320 80.550us 96.770us 20.10%
2 64, 768, 512 101.230us 130.720us 29.10%
3 16, 256, 512 30.230us 51.450us 70.20%
4 128, 128, 128 33.810us 50.480us 49.30%
5 256, 512, 256 154.490us 195.050us 26.30%
6 1024, 1024, 1024 3.134ms 3.514ms 12.10%

ONEDNN showed advantages over FBGEMM for convolution. However, it has performance gap to FBGEMM for Linear ops. The gap is a known issue and further optimization is in progress in the oneDNN library. On the latest platforms, better performance of ONEDNN is achieved for both conv and linear.

@pytorch-probot
Copy link

pytorch-probot bot commented Dec 13, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/Xia-Weiwen/pytorch/blob/8b3cfd2afba2c7d936f37c8e38ffb7f38f66970a/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries/conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries/libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries/libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries/wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk, ciflow/xla ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 14, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 8a40b8c (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@Xia-Weiwen Xia-Weiwen force-pushed the onednn_quant_backend branch 4 times, most recently from ebf6c82 to 0543f13 Compare December 15, 2021 06:08
@XiaobingSuper XiaobingSuper added the intel priority matters to intel architecture from performance wise label Dec 15, 2021
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review December 16, 2021 08:59
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 @VitalyFedyunin please review this PR. Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

The failure does not seem to be caused by this patch

19:07:53   test_fn_fwgrad_bwgrad_gradient_cuda_complex128 (__main__.TestGradientsCUDA) ... Memory exception on virtual address 0x7f46c627a000, node id 4 : Page not present
19:07:53 Address does not belong to a known buffer
19:07:53 Memory access fault by GPU node-4 (Agent handle: 0x55a9bc01d060) on address 0x7f46c627a000. Reason: Page not present or supervisor privilege.
19:08:00 Traceback (most recent call last):
19:08:00   File "test/run_test.py", line 1068, in <module>
19:08:00     main()
19:08:00   File "test/run_test.py", line 1046, in main
19:08:00     raise RuntimeError(err_message)
19:08:00 RuntimeError: test_ops failed! Received signal: SIGIOT

@mruberry mruberry requested a review from vkuzo December 17, 2021 14:24
@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 17, 2021
@vkuzo vkuzo requested a review from jerryzh168 December 17, 2021 14:36
…e=True` in qconfig for unit test TestQuantizedOps.test_custom_module_multi_head_attention. Skip unsupported tests (output padding for deconv)
@Xia-Weiwen Xia-Weiwen force-pushed the onednn_quant_backend branch from 07777f7 to 8bfa256 Compare January 5, 2022 01:58
@Xia-Weiwen
Copy link
Collaborator Author

Now all checks have passed. Please review. Thanks.

Comment on lines 101 to 103
# ONEDNN only supports symmetric quantization of weight
if torch.backends.quantized.engine == 'onednn':
W_q = torch.quantize_per_tensor(W, 0.1, 0, torch.qint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would L100 error out since it's not symmetric quantization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can select scale/zero_point based on qengine instead of hardcode them here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think L100 is OK.

maybe we can select scale/zero_point based on qengine instead of hardcode them here

Do you mean something like this?
zp_weight = 0 if qengine_is_onednn() else torch.randint(1, 10, (1,)).item()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly

X_scale, X_zero_point, W_scale, W_zero_point, Y_scale, Y_zero_point,
use_bias, use_fused, use_channelwise):
# ONEDNN only supports symmetric quantization of weight
if torch.backends.quantized.engine == 'onednn' and not all(zp == 0 for zp in W_zero_point):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we select scale/zero_point based on qengine we won't need this check

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to select proper weight scale/zp by if ... else .... in each unit test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this function is called from: https://github.com/pytorch/pytorch/blob/master/test/quantization/core/test_quantized_module.py#L387, we can generate the scale/zp based on the engine. the current implementation will just skip the check I think

W = torch.rand(out_features, in_features).float()
W_scale, W_zp = _calculate_dynamic_qparams(W, torch.qint8)
# ONEDNN only supports symmetric quantization of weight
if torch.backends.quantized.engine == 'onednn' and W_zp != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here weight scale and zero point are calculated not selected manually. Do you mean we need a new function to calculate weight scale/zero point for symmetric quantization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so, we can add a qscheme argument to https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_quantized.py#L49 to support symmetric quantization and set the qscheme to symmetric quantization when the qengine is mkldnn/onednn

Comment on lines 3212 to 3214
# ONEDNN only supports symmetric quantization of weight
if torch.backends.quantized.engine == 'onednn':
W_zps = np.zeros(output_channels).astype(np.int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can put L3211 in the else branch

…ackend only supports symmetric quantization of weight
@jerryzh168
Copy link
Contributor

I did a rebase and looks like there is still errors, I think the third-party import is probably still not done yet

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168, thanks for the update. Could you please remind @frank-wei to import? Thanks.

@jerryzh168
Copy link
Contributor

Hi @Xia-Weiwen can you resolve the merge conflict? @frank-wei just finished the update of ideep library, we can import the PR now.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Conflict resolved. Please move on. Thanks.

@facebook-github-bot
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jerryzh168
Copy link
Contributor

Hi @Xia-Weiwen, I have imported again. and looks like there are some other errors.

stderr: caffe2/aten/src/ATen/native/quantized/cpu/qlinear.cpp:440:17: error: unused variable 'dim' [-Werror,-Wunused-variable]
  const int64_t dim = input.dim();
stderr: caffe2/aten/src/ATen/native/quantized/cpu/qconv.cpp:911:16: error: unused variable 'with_groups' [-Werror,-Wunused-variable]
    const bool with_groups = groups() > 1;

there are other lint warnings as well and I'm not sure what is the best way to communicate them. but I feel it might be OK to fix them later

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Thanks for the update. Are they all warnings of unused variables? Do you think it's better to fix them now or later in another PR? Anyway, could you please provide a log of these warnings so I can fix them?

@jerryzh168
Copy link
Contributor

Hi @jerryzh168 Thanks for the update. Are they all warnings of unused variables? Do you think it's better to fix them now or later in another PR? Anyway, could you please provide a log of these warnings so I can fix them?

Hi @Xia-Weiwen, unfortunately there is no easy way to export those warnings right now. Can you just fix the blocking ones for now? the one I pasted in the previous comment. I think we can leave the rest there for now.

We're also moving to Github First soon, hopefully these problems can be addressed during that move.

@facebook-github-bot
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jerryzh168
Copy link
Contributor

Hi @Xia-Weiwen I can confirm there is no more internal errors now. but looks like there is some new merge conflict, can you help resolve them? I think we should be able to land after that

@facebook-github-bot
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 11, 2022
Summary:
This PR adds a new quantization backend, ONEDNN, with quantized conv and linear kernels in the same code path as the FBGEMM backend

The ONEDNN backend is an alternative of FBGEMM and QNNPACK backends. It takes advantage of features of the latest Intel® CPU products. It supports VNNI on Cascade Lake and the AMX instruction set to be available on Sapphire Rapids which has 8X int8 peak TOPS over VNNI.

ONEDNN demonstrates better performance on conv kernels of popular CNN models than FBGEMM. It also supports more fused ops, such as convolution-add-ReLU, than FBGEMM and QNNPACK.
To use this backend, users only need to set the quantization backend to 'onednn' before any calculation without a single change to models.
```python
torch.backends.quantized.engine = 'onednn'
```

## Design docs
#21120 (comment)
#67177 (comment)

## File changes
**Add ONEDNN to qengine list**
- aten/src/ATen/Context.cpp
- c10/core/QEngine.h
- torch/ao/quantization/qconfig.py
- torch/backends/quantized/\_\_init\_\_.py

**Implement qconv & qlinear for ONEDNN backend**
- aten/src/ATen/native/quantized/cpu/conv_serialization.h
- aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp
- aten/src/ATen/native/quantized/cpu/onednn_utils.h
- aten/src/ATen/native/quantized/cpu/qconv.cpp
- aten/src/ATen/native/quantized/cpu/qconv_dynamic.cpp
- aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
- aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp
- aten/src/ATen/native/quantized/cpu/qlinear.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp
- aten/src/ATen/native/quantized/cpu/qlinear_unpack.cpp

**Skip tests that are not supported by ONEDNN**
- test/ao/sparsity/test_kernels.py
- test/quantization/core/test_quantized_module.py
- test/quantization/core/test_quantized_op.py

## Validation results
This PR has passed `test_quantization.py` and `test_mkldnn.py`.
Below are performance data of int8 2d convolution and linear on the Cascade Lake Xeon® platform:
(Note: Tested with single instance on single core. Using the latest oneDNN library.)

**Table 1. Performance comparison of int8 2d convolution operator**
|No.|	Shape|	FBGEMM|	ONEDNN|	Gain|
|-|-|-|-|-|
|1|	IC=128, OC=128, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0|	668.310us|	535.630us|	24.8%|
|2|	IC=128, OC=128, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0|	290.630us|	281.810us|	3.1%|
|3|	IC=128, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0|	1.045ms|	893.010us|	17.0%|
|4|	IC=128, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0|	385.320us|	373.720us|	3.1%|
|5|	IC=256, OC=256, kernel=3, stride=1, N=4, H=32, W=32, G=1, pad=0|	1.876ms|	1.641ms|	14.3%|
|6|	IC=256, OC=256, kernel=3, stride=2, N=4, H=32, W=32, G=1, pad=0|	660.460us|	638.470us|	3.4%|

**Table 2. Performance comparison of int8 linear operator**
|No.|	Shape (m, n, k)|	FBGEMM|	ONEDNN|	Gap|
|-|-|-|-|-|
|1|	64, 800, 320|	80.550us|	96.770us|	20.10%|
|2|	64, 768, 512|	101.230us|	130.720us|	29.10%|
|3|	16, 256, 512|	30.230us|	51.450us|	70.20%|
|4|	128, 128, 128|	33.810us|	50.480us|	49.30%|
|5|	256, 512, 256|	154.490us|	195.050us|	26.30%|
|6|	1024, 1024, 1024|	3.134ms|	3.514ms|	12.10%|

ONEDNN showed advantages over FBGEMM for convolution. However, it has performance gap to FBGEMM for Linear ops. The gap is a known issue and further optimization is in progress in the oneDNN library. On the latest platforms, better performance of ONEDNN is achieved for both conv and linear.

Pull Request resolved: #69820

Reviewed By: HDCharles

Differential Revision: D33716039

Pulled By: jerryzh168

fbshipit-source-id: 6f7bb807e85798142dfcffccfca8b8bd652fb3dd
@github-actions
Copy link
Contributor

Hey @Xia-Weiwen.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

#include <ATen/Config.h>
#if AT_MKLDNN_ENABLED()
#include <ATen/Tensor.h>
#include <ATen/native/quantized/cpu/conv_packed_params.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Xia-Weiwen, I landed the PR but looks like this line is not up to date. we should remove this line. I'm reverting the change right now, can you help recreate the PR after this is reverted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryzh168, I created a new PR #74137. Please take a look. Thanks.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 5a89753. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

facebook-github-bot pushed a commit that referenced this pull request Mar 15, 2022
Summary:
Resolve the conflicts in #69820
jerryzh168 Please review. Thanks.

Pull Request resolved: #74137

Reviewed By: samdow

Differential Revision: D34840477

Pulled By: jerryzh168

fbshipit-source-id: 8aa60981ff7be211a1609644f273b16d18efd425
pytorchmergebot pushed a commit that referenced this pull request Mar 15, 2022
Summary:
Resolve the conflicts in #69820
jerryzh168 Please review. Thanks.

Pull Request resolved: #74137

Reviewed By: samdow

Differential Revision: D34840477

Pulled By: jerryzh168

fbshipit-source-id: 8aa60981ff7be211a1609644f273b16d18efd425
(cherry picked from commit de76bb8)
@jerryzh168 jerryzh168 added release notes: quantization release notes category topic: new features topic category labels Mar 15, 2022
@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 5a89753. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel open source release notes: quantization release notes category Reverted topic: new features topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

9 participants