Skip to content

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented May 19, 2020

Stack from ghstack:

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))

Before:

torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554

After:

torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464

Differential Revision: D22291236

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request May 19, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: 3266423
Pull Request resolved: #38697
@dr-ci
Copy link

dr-ci bot commented May 19, 2020

💊 CI failures summary and remediations

As of commit c339042 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 44 times.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request May 20, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: cf85234
Pull Request resolved: #38697
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2020
@gchanan gchanan removed their request for review June 2, 2020 16:52
@xuhdev xuhdev requested a review from ezyang June 9, 2020 18:34
@ezyang
Copy link
Contributor

ezyang commented Jun 9, 2020

I notice that you have also serialized the kernel here, but the original was parallel. Can't the vectorized version also be parallelized?

@ezyang
Copy link
Contributor

ezyang commented Jun 11, 2020

OK, I guess this follows the precedent set by linspace. I'll give @VitalyFedyunin a chance to take a look.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Jun 16, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: 2847db5
Pull Request resolved: #38697
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 16, 2020

Rebased

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 24, 2020

@VitalyFedyunin Could you review this :)

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 29, 2020

This PR has been sitting here for more than one month... Could someone review this, please?

@VitalyFedyunin
Copy link
Contributor

VitalyFedyunin commented Jun 29, 2020

Sorry it required some time to confirm my suspicion:

o = torch.rand(3, 3, 300000).int()
o = o.permute(2, 0, 1)
torch.arange(0, 300000*3*3, out=o)

Will end-up with incorrect result. For example in compare to CUDA or older code.

o = torch.rand(3, 3, 300000).int().cuda()
o = o.permute(2, 0, 1)
torch.arange(0, 300000*3*3, out=o)

just to be clear, we are looking at pre-allocated non-contiguous output tensors.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
@ngimel
Copy link
Collaborator

ngimel commented Jul 7, 2020

imo this example

o = torch.rand(3, 3, 300000).int().cuda()
o = o.permute(2, 0, 1)
torch.arange(0, 300000*3*3, out=o)

where o is still 3d tensor even after the arange op is a demonstration of incorrect resizing behavior for out args, and should be changed. Issue to track resizing behavior #41027

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 10, 2020

#20342 might also be relevant.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Jul 10, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: 4d6b367
Pull Request resolved: #38697
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Jul 11, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: e89fa8b
Pull Request resolved: #38697
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 12, 2020

@VitalyFedyunin I added back the contiguity test that I have removed in this PR, and all tests have passed now.

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Jul 20, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: 32f3386
Pull Request resolved: #38697
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 20, 2020

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 20, 2020
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 29, 2020

Can we merge this now?

TensorIterator it(iter);
cpu_serial_kernel_vec(
it,
[start, step, steps, &idx]() -> scalar_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Sorry it fails internal hard linters with

./aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp:27:25: error: lambda capture 'steps' is not used [-Werror,-Wunused-lambda-capture]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin I've pushed an update to fix this, but the internal test still failed. Could you share the new error?

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Differential Revision: [D22291236](https://our.internmc.facebook.com/intern/diff/D22291236)

[ghstack-poisoned]
xuhdev added a commit that referenced this pull request Jul 29, 2020
Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

ghstack-source-id: e85c19a
Pull Request resolved: #38697
@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 34025eb.

@xuhdev xuhdev deleted the gh/xuhdev/78/head branch August 3, 2020 18:39
@xuhdev
Copy link
Collaborator Author

xuhdev commented Aug 3, 2020

Thank you @VitalyFedyunin !

peterbell10 added a commit that referenced this pull request Jul 2, 2021
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586, closes gh-38697

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jul 2, 2021
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586, closes gh-38697

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

ghstack-source-id: 197f4f7
Pull Request resolved: #61153
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants