Vectorize arange #38697

xuhdev · 2020-05-19T01:08:39Z

Stack from ghstack:

Vectorize arange #38697 Vectorize arange
Let TensorIterator::nullary_op support check_mem_overlap option #38693 Let TensorIterator::nullary_op support check_mem_overlap option

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))

Before:

torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554

After:

torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464

Differential Revision: D22291236

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: 3266423 Pull Request resolved: #38697

dr-ci · 2020-05-19T03:02:45Z

💊 CI failures summary and remediations

As of commit c339042 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 44 times.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: cf85234 Pull Request resolved: #38697

ezyang · 2020-06-09T20:49:53Z

I notice that you have also serialized the kernel here, but the original was parallel. Can't the vectorized version also be parallelized?

aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp

ezyang · 2020-06-11T14:54:21Z

OK, I guess this follows the precedent set by linspace. I'll give @VitalyFedyunin a chance to take a look.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: 2847db5 Pull Request resolved: #38697

xuhdev · 2020-06-16T19:01:24Z

Rebased

xuhdev · 2020-06-24T21:08:21Z

@VitalyFedyunin Could you review this :)

xuhdev · 2020-06-29T19:37:39Z

This PR has been sitting here for more than one month... Could someone review this, please?

VitalyFedyunin · 2020-06-29T20:22:44Z

Sorry it required some time to confirm my suspicion:

o = torch.rand(3, 3, 300000).int()
o = o.permute(2, 0, 1)
torch.arange(0, 300000*3*3, out=o)

Will end-up with incorrect result. For example in compare to CUDA or older code.

o = torch.rand(3, 3, 300000).int().cuda()
o = o.permute(2, 0, 1)
torch.arange(0, 300000*3*3, out=o)

just to be clear, we are looking at pre-allocated non-contiguous output tensors.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236) [ghstack-poisoned]

ngimel · 2020-07-07T19:51:04Z

imo this example

o = torch.rand(3, 3, 300000).int().cuda()
o = o.permute(2, 0, 1)
torch.arange(0, 300000*3*3, out=o)

where o is still 3d tensor even after the arange op is a demonstration of incorrect resizing behavior for out args, and should be changed. Issue to track resizing behavior #41027

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236) [ghstack-poisoned]

xuhdev · 2020-07-10T21:40:13Z

#20342 might also be relevant.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236) [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: 4d6b367 Pull Request resolved: #38697

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236) [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: e89fa8b Pull Request resolved: #38697

xuhdev · 2020-07-12T04:22:24Z

@VitalyFedyunin I added back the contiguity test that I have removed in this PR, and all tests have passed now.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236) [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: 32f3386 Pull Request resolved: #38697

xuhdev · 2020-07-20T21:41:15Z

@pytorchbot merge this please

xuhdev · 2020-07-29T18:59:29Z

Can we merge this now?

VitalyFedyunin · 2020-07-29T21:10:31Z

aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp

+      TensorIterator it(iter);
+      cpu_serial_kernel_vec(
+          it,
+          [start, step, steps, &idx]() -> scalar_t {


Hello! Sorry it fails internal hard linters with

./aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp:27:25: error: lambda capture 'steps' is not used [-Werror,-Wunused-lambda-capture]

@VitalyFedyunin I've pushed an update to fix this, but the internal test still failed. Could you share the new error?

aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236) [ghstack-poisoned]

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` ghstack-source-id: e85c19a Pull Request resolved: #38697

facebook-github-bot · 2020-08-03T18:20:22Z

@VitalyFedyunin merged this pull request in 34025eb.

xuhdev · 2020-08-03T18:39:21Z

Thank you @VitalyFedyunin !

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586, closes gh-38697 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | [ghstack-poisoned]

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586, closes gh-38697 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 197f4f7 Pull Request resolved: #61153

xuhdev mentioned this pull request May 19, 2020

Let TensorIterator::nullary_op support check_mem_overlap option #38693

Closed

pytorchbot added the open source label May 19, 2020

xuhdev requested review from VitalyFedyunin, gchanan, malfet and yf225 May 20, 2020 17:29

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2020

gchanan removed their request for review June 2, 2020 16:52

xuhdev requested a review from ezyang June 9, 2020 18:34

xuhdev commented Jun 9, 2020

View reviewed changes

aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp Outdated Show resolved Hide resolved

VitalyFedyunin approved these changes Jul 15, 2020

View reviewed changes

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 20, 2020

VitalyFedyunin reviewed Jul 29, 2020

View reviewed changes

aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp Show resolved Hide resolved

facebook-github-bot closed this in 34025eb Aug 3, 2020

facebook-github-bot added the merged label Aug 3, 2020

xuhdev deleted the gh/xuhdev/78/head branch August 3, 2020 18:39

mruberry added the Merged label Oct 28, 2020

mruberry mentioned this pull request Nov 2, 2020

torch.arange numerics are different after 1.7 update on CPU #47043

Closed

peterbell10 mentioned this pull request Jul 2, 2021

Migrate glu from the THC to ATen (CUDA) #61153

Closed

Vectorize arange #38697

Vectorize arange #38697

Uh oh!

Conversation

xuhdev commented May 19, 2020 • edited by VitalyFedyunin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

ezyang commented Jun 9, 2020

Uh oh!

Uh oh!

ezyang commented Jun 11, 2020

Uh oh!

xuhdev commented Jun 16, 2020

Uh oh!

xuhdev commented Jun 24, 2020

Uh oh!

xuhdev commented Jun 29, 2020

Uh oh!

VitalyFedyunin commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jul 7, 2020

Uh oh!

xuhdev commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuhdev commented Jul 12, 2020

Uh oh!

xuhdev commented Jul 20, 2020

Uh oh!

xuhdev commented Jul 29, 2020

Uh oh!

VitalyFedyunin Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

xuhdev Jul 30, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Aug 3, 2020

Uh oh!

xuhdev commented Aug 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

xuhdev commented May 19, 2020 •

edited by VitalyFedyunin

Loading

dr-ci bot commented May 19, 2020 •

edited

Loading

VitalyFedyunin commented Jun 29, 2020 •

edited

Loading

xuhdev commented Jul 10, 2020 •

edited

Loading