[quant][core][performance] Removed int_repr calls in quantized conv2d cudnn implementation #73849

dzdang · 2022-03-07T14:08:02Z

Stack from ghstack (oldest at bottom):

[quant][core][performance] Changed cudnn quantized conv2d impl to use inplace operations #73857
-> [quant][core][performance] Removed int_repr calls in quantized conv2d cudnn implementation #73849

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute

python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn

for accuracy testing and

python test/test_quantization.py TestQuantizedConv.test_benchmark

for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms

Current int8 benchmark:

int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms

Differential Revision: D34824248

… cudnn implementation Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` [ghstack-poisoned]

pytorch-bot · 2022-03-07T14:08:07Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/e92cb2c830af21efcbc09f639f59ddaeff7ae48f/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-03-07T14:08:09Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73849
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 08f0b51 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

… cudnn implementation Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` ghstack-source-id: af81fc3 Pull Request resolved: #73849

…ized conv2d cudnn implementation" Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` [ghstack-poisoned]

jerryzh168 · 2022-03-11T15:48:34Z

aten/src/ATen/cudnn/Types.cpp

 }

 cudnnDataType_t getCudnnDataType(const at::Tensor& tensor) {
+  if (tensor.is_quantized()) {


should we add this to getCudnnDataTypeFromScalarType?

It seems we never call getCudnnDataTypeFromScalarType directly since it's called from getCudnnDataType so I was thinking it'd be better to do it from the calling function but maybe it's clearer to do it your way. i can make that change

jerryzh168 · 2022-03-11T15:48:53Z

aten/src/ATen/native/quantized/cudnn/Conv.cpp

  c10::optional<at::Tensor> after_add;
  c10::optional<at::Tensor> broadcasted_bias;
  c10::optional<at::Tensor> after_relu;
+<<<<<<< HEAD


looks like there are some unresolved merge conflict

jerryzh168 · 2022-03-11T15:49:39Z

aten/src/ATen/native/quantized/cudnn/Conv.cpp

    uids.reserve(10);
    data_ptrs = {reinterpret_cast<int8_t*>(input.data_ptr()), conv_output.data_ptr(),
-                                           reinterpret_cast<int8_t*>(weight.data_ptr()),
+                                           reinterpret_cast<int8_t*>(orig_weight_.data_ptr()),


this looks like unrelated to this PR, should this happen in a different PR?

I think something got messed up when I rebased yesterday. This should've already been done in a previous PR

jerryzh168 · 2022-03-11T15:51:12Z

aten/src/ATen/native/quantized/cudnn/Conv.cpp

  // TODO: combine empty & fill_ using full_like or full
  at::Tensor requantize_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
+  auto act_scale = input.q_scale();
+  auto weight_scale = orig_weight.q_scale();


orig_weight --> orig_weight_

hmm looks like my rebase yesterday wasn't done properly. I'll fix this

…ized conv2d cudnn implementation" Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` [ghstack-poisoned]

dzdang · 2022-03-11T18:12:19Z

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/quantized/cudnn/Conv.cpp

…ized conv2d cudnn implementation" Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248) [ghstack-poisoned]

dzdang · 2022-03-17T15:45:33Z

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…ized conv2d cudnn implementation" Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248) [ghstack-poisoned]

dzdang · 2022-03-17T17:54:09Z

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

… cudnn implementation (#73849) Summary: Pull Request resolved: #73849 This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test Plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` Reviewed By: jerryzh168 Differential Revision: D34824248 Pulled By: dzdang fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba

… cudnn implementation (#73849) Summary: Pull Request resolved: #73849 This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test Plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` Reviewed By: jerryzh168 Differential Revision: D34824248 Pulled By: dzdang fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba (cherry picked from commit e52ce62)

dzdang mentioned this pull request Mar 7, 2022

[Quant][core] Merged conv packed params and linear packed params #73486

Closed

pytorch-bot bot added the ciflow/default label Mar 7, 2022

dzdang mentioned this pull request Mar 7, 2022

[Quant][core][gpu][improvement] Refactored implementation for conv2d_cudnn to use packed parameters #73510

Closed

facebook-github-bot added the cla signed label Mar 7, 2022

dzdang mentioned this pull request Mar 7, 2022

[Quant][core][refactorization] Refactored qconv_unpack.cpp into an implementation file and higher level call registration and definition file #73773

Closed

dzdang added the ciflow/macos label Mar 7, 2022

dzdang mentioned this pull request Mar 7, 2022

[quant][core][performance] Changed cudnn quantized conv2d impl to use inplace operations #73857

Closed

dzdang added 4 commits March 7, 2022 13:14

dzdang requested a review from jerryzh168 March 8, 2022 14:21

dzdang mentioned this pull request Mar 9, 2022

[quant][gpu][core] Added quantized linear operator in cudnn #73959

Closed

dzdang added 3 commits March 10, 2022 13:13

jerryzh168 reviewed Mar 11, 2022

View reviewed changes

dzdang added 2 commits March 11, 2022 10:07

dzdang requested a review from jerryzh168 March 11, 2022 18:39

jerryzh168 reviewed Mar 11, 2022

View reviewed changes

aten/src/ATen/native/quantized/cudnn/Conv.cpp Show resolved Hide resolved

jerryzh168 approved these changes Mar 11, 2022

View reviewed changes

dzdang added 4 commits March 17, 2022 08:29

pytorchmergebot closed this in caed2a1 Mar 18, 2022

facebook-github-bot deleted the gh/dzdang/47/head branch March 21, 2022 14:17

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

[quant][core][performance] Removed int_repr calls in quantized conv2d cudnn implementation #73849

[quant][core][performance] Removed int_repr calls in quantized conv2d cudnn implementation #73849

Uh oh!

Conversation

dzdang commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 7, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

jerryzh168 Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

dzdang Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Mar 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzdang Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

dzdang Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

dzdang commented Mar 11, 2022

Uh oh!

Uh oh!

dzdang commented Mar 17, 2022

Uh oh!

dzdang commented Mar 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dzdang commented Mar 7, 2022 •

edited

Loading

facebook-github-bot commented Mar 7, 2022 •

edited

Loading

jerryzh168 Mar 11, 2022 •

edited

Loading