-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[quant][core][performance] Removed int_repr calls in quantized conv2d cudnn implementation #73849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… cudnn implementation
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 08f0b51 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
… cudnn implementation
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
ghstack-source-id: af81fc3
Pull Request resolved: #73849
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
aten/src/ATen/cudnn/Types.cpp
Outdated
| } | ||
|
|
||
| cudnnDataType_t getCudnnDataType(const at::Tensor& tensor) { | ||
| if (tensor.is_quantized()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add this to getCudnnDataTypeFromScalarType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we never call getCudnnDataTypeFromScalarType directly since it's called from getCudnnDataType so I was thinking it'd be better to do it from the calling function but maybe it's clearer to do it your way. i can make that change
| c10::optional<at::Tensor> after_add; | ||
| c10::optional<at::Tensor> broadcasted_bias; | ||
| c10::optional<at::Tensor> after_relu; | ||
| <<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like there are some unresolved merge conflict
| uids.reserve(10); | ||
| data_ptrs = {reinterpret_cast<int8_t*>(input.data_ptr()), conv_output.data_ptr(), | ||
| reinterpret_cast<int8_t*>(weight.data_ptr()), | ||
| reinterpret_cast<int8_t*>(orig_weight_.data_ptr()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like unrelated to this PR, should this happen in a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something got messed up when I rebased yesterday. This should've already been done in a previous PR
| // TODO: combine empty & fill_ using full_like or full | ||
| at::Tensor requantize_multiplier_tensor = at::empty(quantized_output.sizes(), at::device(at::kCUDA).dtype(at::kFloat), at::MemoryFormat::ChannelsLast); | ||
| auto act_scale = input.q_scale(); | ||
| auto weight_scale = orig_weight.q_scale(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orig_weight --> orig_weight_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm looks like my rebase yesterday wasn't done properly. I'll fix this
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
[ghstack-poisoned]
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)
[ghstack-poisoned]
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)
[ghstack-poisoned]
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…ized conv2d cudnn implementation"
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
Differential Revision: [D34824248](https://our.internmc.facebook.com/intern/diff/D34824248)
[ghstack-poisoned]
|
@dzdang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… cudnn implementation (#73849) Summary: Pull Request resolved: #73849 This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test Plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` Reviewed By: jerryzh168 Differential Revision: D34824248 Pulled By: dzdang fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba
… cudnn implementation (#73849) Summary: Pull Request resolved: #73849 This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test Plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` Reviewed By: jerryzh168 Differential Revision: D34824248 Pulled By: dzdang fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba (cherry picked from commit e52ce62)
Stack from ghstack (oldest at bottom):
Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test plan:
In pytorch main directory, execute
for accuracy testing and
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
Current int8 benchmark:
Differential Revision: D34824248