Make TensorIterator stop promoting types by copying #28344

zasdfgbnm · 2019-10-20T03:15:25Z

Stack from ghstack:

Simplify copy kernel #28352 Simplify copy kernel
Make TensorIterator stop promoting types by copying #28344 Make TensorIterator stop promoting types by copying
Move type casting to c10/util/TypeCast.h #28343 Move type casting to c10/util/TypeCast.h

This PR fixes the issue by using the newly added dynamic cast inside
TensorIterator so that instead of converting the type at the beginning
(which generates extra kernel launches), the TensorIterator do a
load-cast-compute-store for each element while looping. So there is only
one read and one write of memory.

nvprof:

import torch

_100M = 100 * 1024 ** 2
r = torch.randn(_100M, dtype=torch.float32, device='cuda')
d = torch.randn(_100M, dtype=torch.float64, device='cuda')
torch.cuda.synchronize()
torch.cuda.profiler.start()
r.add_(d)
torch.cuda.profiler.stop()
torch.cuda.synchronize()

==11407== NVPROF is profiling process 11407, command:
/home/xgao/anaconda3/bin/python simple.py
==11407== Profiling application: /home/xgao/anaconda3/bin/python
simple.py
==11407== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min
Max  Name
 GPU activities:  100.00%  2.0611ms         1  2.0611ms  2.0611ms
2.0611ms
_ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_
      API calls:  100.00%  1.05006s         1  1.05006s  1.05006s
1.05006s  cudaLaunchKernel
                    0.00%  2.7740us         2  1.3870us     673ns
2.1010us  cudaGetDevice
                    0.00%  2.3730us         1  2.3730us  2.3730us
2.3730us  cudaSetDevice
                    0.00%     830ns         1     830ns     830ns
830ns  cudaGetLastError

benchmark

import torch
print(torch.__version__)
print(torch.version.git_version)

_100M = 100 * 1024 ** 2
r = torch.randn(_100M, dtype=torch.float32, device='cuda')
d = torch.randn(_100M, dtype=torch.float64, device='cuda')
torch.cuda.synchronize()
%timeit r.add_(d); torch.cuda.synchronize()

original

1.4.0a0+7d277b0
7d277b0670eb1f9098a7e098e93b20453e8b5c9f
6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

after

1.4.0a0+f0f2f65
f0f2f654cba9b8c569f0bcd583732bbc891f80b2
2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

Fixes: #26401 This PR fixes the issue by using the newly added dynamic cast inside `TensorIterator` so that instead of converting the type at the beginning (which generates extra kernel launches), the `TensorIterator` do a load-cast-compute-store for each element while looping. So there is only one read and one write of memory. **nvprof:** ```python import torch _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() torch.cuda.profiler.start() r.add_(d) torch.cuda.profiler.stop() torch.cuda.synchronize() ``` ``` ==11407== NVPROF is profiling process 11407, command: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling application: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 2.0611ms 1 2.0611ms 2.0611ms 2.0611ms _ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_ API calls: 100.00% 1.05006s 1 1.05006s 1.05006s 1.05006s cudaLaunchKernel 0.00% 2.7740us 2 1.3870us 673ns 2.1010us cudaGetDevice 0.00% 2.3730us 1 2.3730us 2.3730us 2.3730us cudaSetDevice 0.00% 830ns 1 830ns 830ns 830ns cudaGetLastError ``` **benchmark** ```python import torch print(torch.__version__) print(torch.version.git_version) _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() %timeit r.add_(d); torch.cuda.synchronize() ``` original ``` 1.4.0a0+7d277b0 7d277b0 6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 1.4.0a0+f0f2f65 f0f2f65 2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` ghstack-source-id: 1153268 Pull Request resolved: #28344

Fixes: #26401 This PR fixes the issue by using the newly added dynamic cast inside `TensorIterator` so that instead of converting the type at the beginning (which generates extra kernel launches), the `TensorIterator` do a load-cast-compute-store for each element while looping. So there is only one read and one write of memory. **nvprof:** ```python import torch _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() torch.cuda.profiler.start() r.add_(d) torch.cuda.profiler.stop() torch.cuda.synchronize() ``` ``` ==11407== NVPROF is profiling process 11407, command: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling application: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 2.0611ms 1 2.0611ms 2.0611ms 2.0611ms _ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_ API calls: 100.00% 1.05006s 1 1.05006s 1.05006s 1.05006s cudaLaunchKernel 0.00% 2.7740us 2 1.3870us 673ns 2.1010us cudaGetDevice 0.00% 2.3730us 1 2.3730us 2.3730us 2.3730us cudaSetDevice 0.00% 830ns 1 830ns 830ns 830ns cudaGetLastError ``` **benchmark** ```python import torch print(torch.__version__) print(torch.version.git_version) _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() %timeit r.add_(d); torch.cuda.synchronize() ``` original ``` 1.4.0a0+7d277b0 7d277b0 6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 1.4.0a0+f0f2f65 f0f2f65 2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` [ghstack-poisoned]

ngimel · 2019-10-21T17:31:58Z

aten/src/ATen/native/cpu/Loops.h

+  constexpr int ntensors = traits::arity + 1;
+
+  // Copying strides to temporary array helps auto vectorization in older GCC
+  // versions.


which gcc versions need this? Note that gcc 5 is no longer supported, so workarounds for it are not necessary.

I don't know. It was copy-pasted from the existing code and modified. Let me try to find the answer.

I searched gcc's changelogs at for example https://gcc.gnu.org/gcc-7/changes.html for vectoriz for different gcc versions. The only thing that interests me is at https://gcc.gnu.org/gcc-5/changes.html. I don't know if this is related.

@colesbury might know the answer.

VitalyFedyunin · 2019-10-21T20:12:12Z

Please also include CPU benchmarks with and without type promotions

zasdfgbnm · 2019-10-21T20:35:04Z

@VitalyFedyunin There is very little change in the performance.

The benchmark is as follows:

import torch
print(torch.__version__)
print(torch.version.git_version)

_100M = 100 * 1024 ** 2
r = torch.randn(_100M, dtype=torch.float32, device='cpu')
d = torch.randn(_100M, dtype=torch.float64, device='cpu')
%timeit r.add_(d);

before

1.4.0a0+f6c0a89
f6c0a89acc929427e2e02a8215e7692455271178
129 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

after

1.4.0a0+1c1c778
1c1c778b69e7473f2ddb6dbf69c8e626f33831ff
121 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

zasdfgbnm · 2019-10-21T20:57:54Z

Without promotion on CPU:

import torch
print(torch.__version__)
print(torch.version.git_version)

_100M = 100 * 1024 ** 2
a = torch.randn(_100M, dtype=torch.float32, device='cpu')
b = torch.randn(_100M, dtype=torch.float32, device='cpu')
%timeit a.add_(b);

before

1.4.0a0+f6c0a89
f6c0a89acc929427e2e02a8215e7692455271178
23.3 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

after

1.4.0a0+1c1c778
1c1c778b69e7473f2ddb6dbf69c8e626f33831ff
23.1 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

zasdfgbnm · 2019-10-21T20:59:05Z

Without promotion on GPU:

import torch
print(torch.__version__)
print(torch.version.git_version)

_100M = 100 * 1024 ** 2
a = torch.randn(_100M, dtype=torch.float32, device='cuda')
b = torch.randn(_100M, dtype=torch.float32, device='cuda')
torch.cuda.synchronize()
%timeit a.add_(b); torch.cuda.synchronize()

before

1.4.0a0+f6c0a89
f6c0a89acc929427e2e02a8215e7692455271178
1.53 ms ± 455 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

after

1.4.0a0+1c1c778
1c1c778b69e7473f2ddb6dbf69c8e626f33831ff
1.54 ms ± 273 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

zasdfgbnm · 2019-10-21T21:01:42Z

The case when there is no promotion is dispatched to the original code at https://github.com/pytorch/pytorch/pull/28344/files#diff-0d1178f1a4ce15aeb760d251974e6924R242
and
https://github.com/pytorch/pytorch/pull/28344/files#diff-baaaf6b9adceeef7f820bc73f6e49a44R161
So there shouldn't be any performance change.

zasdfgbnm · 2019-10-22T17:35:54Z

I messed up the PRs and they are merged by ghstack... Will resubmit soon.

Fixes: #26401 This PR fixes the issue by using the newly added dynamic cast inside `TensorIterator` so that instead of converting the type at the beginning (which generates extra kernel launches), the `TensorIterator` do a load-cast-compute-store for each element while looping. So there is only one read and one write of memory. **nvprof:** ```python import torch _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() torch.cuda.profiler.start() r.add_(d) torch.cuda.profiler.stop() torch.cuda.synchronize() ``` ``` ==11407== NVPROF is profiling process 11407, command: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling application: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 2.0611ms 1 2.0611ms 2.0611ms 2.0611ms _ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_ API calls: 100.00% 1.05006s 1 1.05006s 1.05006s 1.05006s cudaLaunchKernel 0.00% 2.7740us 2 1.3870us 673ns 2.1010us cudaGetDevice 0.00% 2.3730us 1 2.3730us 2.3730us 2.3730us cudaSetDevice 0.00% 830ns 1 830ns 830ns 830ns cudaGetLastError ``` **benchmark** ```python import torch print(torch.__version__) print(torch.version.git_version) _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() %timeit r.add_(d); torch.cuda.synchronize() ``` original ``` 1.4.0a0+7d277b0 7d277b0 6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 1.4.0a0+f0f2f65 f0f2f65 2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` For more benchmark, see: #28344 [ghstack-poisoned]

Summary: Pull Request resolved: #28427 Fixes: #26401 This PR fixes the issue by using the newly added dynamic cast inside `TensorIterator` so that instead of converting the type at the beginning (which generates extra kernel launches), the `TensorIterator` do a load-cast-compute-store for each element while looping. So there is only one read and one write of memory. **nvprof:** ```python import torch _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() torch.cuda.profiler.start() r.add_(d) torch.cuda.profiler.stop() torch.cuda.synchronize() ``` ``` ==11407== NVPROF is profiling process 11407, command: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling application: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 2.0611ms 1 2.0611ms 2.0611ms 2.0611ms _ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_ API calls: 100.00% 1.05006s 1 1.05006s 1.05006s 1.05006s cudaLaunchKernel 0.00% 2.7740us 2 1.3870us 673ns 2.1010us cudaGetDevice 0.00% 2.3730us 1 2.3730us 2.3730us 2.3730us cudaSetDevice 0.00% 830ns 1 830ns 830ns 830ns cudaGetLastError ``` **benchmark** ```python import torch print(torch.__version__) print(torch.version.git_version) _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() %timeit r.add_(d); torch.cuda.synchronize() ``` original ``` 1.4.0a0+7d277b0 7d277b0 6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 1.4.0a0+f0f2f65 f0f2f65 2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` For more benchmark, see: #28344 Test Plan: Imported from OSS Differential Revision: D18170997 Pulled By: ezyang fbshipit-source-id: 9c82c1c89583f3e6202c5d790b9b73ad9f960fad

Summary: Pull Request resolved: pytorch/pytorch#28427 Fixes: pytorch/pytorch#26401 This PR fixes the issue by using the newly added dynamic cast inside `TensorIterator` so that instead of converting the type at the beginning (which generates extra kernel launches), the `TensorIterator` do a load-cast-compute-store for each element while looping. So there is only one read and one write of memory. **nvprof:** ```python import torch _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() torch.cuda.profiler.start() r.add_(d) torch.cuda.profiler.stop() torch.cuda.synchronize() ``` ``` ==11407== NVPROF is profiling process 11407, command: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling application: /home/xgao/anaconda3/bin/python simple.py ==11407== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 2.0611ms 1 2.0611ms 2.0611ms 2.0611ms _ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_ API calls: 100.00% 1.05006s 1 1.05006s 1.05006s 1.05006s cudaLaunchKernel 0.00% 2.7740us 2 1.3870us 673ns 2.1010us cudaGetDevice 0.00% 2.3730us 1 2.3730us 2.3730us 2.3730us cudaSetDevice 0.00% 830ns 1 830ns 830ns 830ns cudaGetLastError ``` **benchmark** ```python import torch print(torch.__version__) print(torch.version.git_version) _100M = 100 * 1024 ** 2 r = torch.randn(_100M, dtype=torch.float32, device='cuda') d = torch.randn(_100M, dtype=torch.float64, device='cuda') torch.cuda.synchronize() %timeit r.add_(d); torch.cuda.synchronize() ``` original ``` 1.4.0a0+7d277b0 7d277b0670eb1f9098a7e098e93b20453e8b5c9f 6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 1.4.0a0+f0f2f65 f0f2f654cba9b8c569f0bcd583732bbc891f80b2 2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` For more benchmark, see: pytorch/pytorch#28344 Test Plan: Imported from OSS Differential Revision: D18170997 Pulled By: ezyang fbshipit-source-id: 9c82c1c89583f3e6202c5d790b9b73ad9f960fad

zasdfgbnm mentioned this pull request Oct 20, 2019

Move type casting to c10/util/TypeCast.h #28343

Merged

zasdfgbnm requested review from colesbury, nairbv and ngimel October 20, 2019 07:38

zasdfgbnm requested a review from VitalyFedyunin October 20, 2019 08:21

zasdfgbnm mentioned this pull request Oct 21, 2019

Simplify copy kernel #28352

Merged

ngimel reviewed Oct 21, 2019

View reviewed changes

zasdfgbnm merged commit a3a32ff into gh/zasdfgbnm/9/base Oct 22, 2019

zasdfgbnm mentioned this pull request Oct 22, 2019

Make TensorIterator stop promoting types by copying #28427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make TensorIterator stop promoting types by copying #28344

Make TensorIterator stop promoting types by copying #28344

Uh oh!

zasdfgbnm commented Oct 20, 2019 •

edited

Loading

Uh oh!

ngimel Oct 21, 2019

Uh oh!

zasdfgbnm Oct 21, 2019

Uh oh!

zasdfgbnm Oct 21, 2019

Uh oh!

VitalyFedyunin commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make TensorIterator stop promoting types by copying #28344

Make TensorIterator stop promoting types by copying #28344

Uh oh!

Conversation

zasdfgbnm commented Oct 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

VitalyFedyunin commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 21, 2019

Uh oh!

zasdfgbnm commented Oct 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zasdfgbnm commented Oct 20, 2019 •

edited

Loading