Skip to content

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Sep 15, 2020

int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in #43033, this is an important type for vectorization because
a lot of images are loaded in this data type.

Related issue: #43033

Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):

import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
    for n, t in [(10_000, 200000),
                (100_000, 20000)]:
        print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))

Results:

Before:

a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235

After:

a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Sep 16, 2020

Does it also apply to unsigned uint8_t? Or a separate code changes are needed?

@xuhdev
Copy link
Collaborator Author

xuhdev commented Sep 16, 2020

The plan is to do uint8 in a separate PR, because the dealing with uint8 is likely very different from int8 (e.g., availability of intrinsic instructions, unsignability, etc.). This PR is already large enough and it makes sense to break things down to 2 PRs.

@codecov
Copy link

codecov bot commented Sep 16, 2020

Codecov Report

Merging #44759 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #44759   +/-   ##
=======================================
  Coverage   67.83%   67.83%           
=======================================
  Files         384      384           
  Lines       49962    49962           
=======================================
+ Hits        33892    33894    +2     
+ Misses      16070    16068    -2     
Impacted Files Coverage Δ
torch/utils/_benchmark/utils/common.py 79.33% <0.00%> (+1.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d75c402...1fe4d7a. Read the comment docs.

@ngimel ngimel requested review from VitalyFedyunin and glaringlee and removed request for EscapeZero, albanD and mruberry September 16, 2020 03:15
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 16, 2020
@ezyang
Copy link
Contributor

ezyang commented Sep 16, 2020

Very nice! I'll let Vitaly or Xinyu look at this for now, but holler if you need unblocking

Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuhdev
slightly commented. Looks good to me otherwise. cc @VitalyFedyunin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It seems this is not changed in this PR, but I just curious why we use != here instead of <. @VitalyFedyunin @ezyang

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simply mirrors other integer vec256 classes. @VitalyFedyunin and @ezyang might know the reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these 2 lines be __at_align32__?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added them here. I think the bigger danger is that storeu as defined for integer Vec256 classes requires the parameter to be aligned (e.g., see

_mm256_storeu_si256(reinterpret_cast<__m256i*>(ptr), values);
), but in the code there is no static check. Do you agree? If so, I will be open an issue and find an opportunity to fix this unprotected behavior.

Copy link
Contributor

@glaringlee glaringlee Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuhdev
Yep, this is exactly the place I've looked at.
Let's leave it for this PR and open a new issue specifically for this unprotected behavior. Please cc @VitalyFedyunin @ezyang .

@dr-ci
Copy link

dr-ci bot commented Sep 16, 2020

💊 CI failures summary and remediations

As of commit 1fe4d7a (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@glaringlee
Copy link
Contributor

@xuhdev Can you please do a rebase. I think this is good to land now.

int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t.

Related issue: pytorch#43033

Benchmark (Debian Buster,  Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):

```python
import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
    for n, t in [(10_000, 200000),
                (100_000, 20000)]:
        print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```

Results:

Before:

```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235
```

After:

```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684
```
@xuhdev
Copy link
Collaborator Author

xuhdev commented Sep 19, 2020

@glaringlee Rebased, thanks

@xuhdev xuhdev requested a review from glaringlee September 19, 2020 05:59
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@glaringlee glaringlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@xuhdev xuhdev deleted the vec-int8 branch September 22, 2020 03:20
@facebook-github-bot
Copy link
Contributor

@glaringlee merged this pull request in 4b3046e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants