Skip to content

Conversation

@dzdang
Copy link
Contributor

@dzdang dzdang commented Mar 7, 2022

Stack from ghstack (oldest at bottom):

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute

python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn

for accuracy testing and

python test/test_quantization.py TestQuantizedConv.test_benchmark

for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms

Current int8 benchmark:

int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms

Differential Revision: D34824248

… cudnn implementation

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 7, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/e92cb2c830af21efcbc09f639f59ddaeff7ae48f/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 7, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 08f0b51 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

dzdang added a commit that referenced this pull request Mar 7, 2022
… cudnn implementation

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

ghstack-source-id: af81fc3
Pull Request resolved: #73849
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
dzdang added 4 commits March 7, 2022 13:14
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
@dzdang dzdang requested a review from jerryzh168 March 8, 2022 14:21
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
dzdang added 3 commits March 10, 2022 13:13
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
}

cudnnDataType_t getCudnnDataType(const at::Tensor& tensor) {
if (tensor.is_quantized()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add this to getCudnnDataTypeFromScalarType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we never call getCudnnDataTypeFromScalarType directly since it's called from getCudnnDataType so I was thinking it'd be better to do it from the calling function but maybe it's clearer to do it your way. i can make that change

c10::optional<at::Tensor> after_add;
c10::optional<at::Tensor> broadcasted_bias;
c10::optional<at::Tensor> after_relu;
<<<<<<< HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there are some unresolved merge conflict

uids.reserve(10);
data_ptrs = {reinterpret_cast<int8_t*>(input.data_ptr()), conv_output.data_ptr(),
reinterpret_cast<int8_t*>(weight.data_ptr()),
reinterpret_cast<int8_t*>(orig_weight_.data_ptr()),
Copy link
Contributor

@jerryzh168 jerryzh168 Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like unrelated to this PR, should this happen in a different PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something got messed up when I rebased yesterday. This should've already been done in a previous PR

// TODO: combine empty & fill_ using full_like or full
at::Tensor requantize_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast);
auto act_scale = input.q_scale();
auto weight_scale = orig_weight.q_scale();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orig_weight --> orig_weight_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm looks like my rebase yesterday wasn't done properly. I'll fix this

dzdang added 2 commits March 11, 2022 10:07
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]
@dzdang
Copy link
Contributor Author

dzdang commented Mar 11, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dzdang dzdang requested a review from jerryzh168 March 11, 2022 18:39
dzdang added 4 commits March 17, 2022 08:29
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)

[ghstack-poisoned]
…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)

[ghstack-poisoned]
@dzdang
Copy link
Contributor Author

dzdang commented Mar 17, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)

[ghstack-poisoned]
@dzdang
Copy link
Contributor Author

dzdang commented Mar 17, 2022

@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 18, 2022
… cudnn implementation (#73849)

Summary:
Pull Request resolved: #73849

This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test Plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Reviewed By: jerryzh168

Differential Revision: D34824248

Pulled By: dzdang

fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba
@facebook-github-bot facebook-github-bot deleted the gh/dzdang/47/head branch March 21, 2022 14:17
shahofblah pushed a commit that referenced this pull request Mar 25, 2022
… cudnn implementation (#73849)

Summary:
Pull Request resolved: #73849

This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test Plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

Reviewed By: jerryzh168

Differential Revision: D34824248

Pulled By: dzdang

fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba
(cherry picked from commit e52ce62)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants