Vectorize int8_t on CPU #44759

xuhdev · 2020-09-15T23:32:23Z

int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in #43033, this is an important type for vectorization because
a lot of images are loaded in this data type.

Related issue: #43033

Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):

import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
    for n, t in [(10_000, 200000),
                (100_000, 20000)]:
        print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))

Results:

Before:

a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235

After:

a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684

vadimkantorov · 2020-09-16T00:47:56Z

Does it also apply to unsigned uint8_t? Or a separate code changes are needed?

xuhdev · 2020-09-16T01:06:20Z

The plan is to do uint8 in a separate PR, because the dealing with uint8 is likely very different from int8 (e.g., availability of intrinsic instructions, unsignability, etc.). This PR is already large enough and it makes sense to break things down to 2 PRs.

codecov · 2020-09-16T03:01:50Z

Codecov Report

Merging #44759 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #44759   +/-   ##
=======================================
  Coverage   67.83%   67.83%           
=======================================
  Files         384      384           
  Lines       49962    49962           
=======================================
+ Hits        33892    33894    +2     
+ Misses      16070    16068    -2

Impacted Files	Coverage Δ
torch/utils/_benchmark/utils/common.py	`79.33% <0.00%> (+1.65%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d75c402...1fe4d7a. Read the comment docs.

ezyang · 2020-09-16T13:33:06Z

Very nice! I'll let Vitaly or Xinyu look at this for now, but holler if you need unblocking

glaringlee

@xuhdev
slightly commented. Looks good to me otherwise. cc @VitalyFedyunin

glaringlee · 2020-09-16T15:09:50Z

aten/src/ATen/cpu/vec256/vec256_int.h

nit: It seems this is not changed in this PR, but I just curious why we use != here instead of <. @VitalyFedyunin @ezyang

This simply mirrors other integer vec256 classes. @VitalyFedyunin and @ezyang might know the reason.

aten/src/ATen/cpu/vec256/vec256_int.h

glaringlee · 2020-09-16T16:11:50Z

aten/src/ATen/cpu/vec256/vec256_int.h

Should these 2 lines be __at_align32__?

Yes, I've added them here. I think the bigger danger is that storeu as defined for integer Vec256 classes requires the parameter to be aligned (e.g., see

pytorch/aten/src/ATen/cpu/vec256/vec256_int.h

Line 452 in 3e6bb52

_mm256_storeu_si256(reinterpret_cast<__m256i*>(ptr), values);

), but in the code there is no static check. Do you agree? If so, I will be open an issue and find an opportunity to fix this unprotected behavior.

@xuhdev
Yep, this is exactly the place I've looked at.
Let's leave it for this PR and open a new issue specifically for this unprotected behavior. Please cc @VitalyFedyunin @ezyang .

dr-ci · 2020-09-16T18:44:10Z

💊 CI failures summary and remediations

As of commit 1fe4d7a (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.7-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

facebook-github-bot

@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

glaringlee · 2020-09-18T23:19:52Z

@xuhdev Can you please do a rebase. I think this is good to land now.

int8_t is not vectorized in vec256_int.h. This PR adds vectorization for int8_t. Related issue: pytorch#43033 Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build): ```python import timeit dtype = 'torch.int8' for op in ('+', '-'): for n, t in [(10_000, 200000), (100_000, 20000)]: print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}') print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t)) ``` Results: Before: ``` a + b, numel() == 10000 for 200000 times, dtype=torch.int8 1.2223373489978258 a + b, numel() == 100000 for 20000 times, dtype=torch.int8 0.6108450189931318 a - b, numel() == 10000 for 200000 times, dtype=torch.int8 1.256775538000511 a - b, numel() == 100000 for 20000 times, dtype=torch.int8 0.6101213909860235 ``` After: ``` a + b, numel() == 10000 for 200000 times, dtype=torch.int8 0.5713336059998255 a + b, numel() == 100000 for 20000 times, dtype=torch.int8 0.39169703199877404 a - b, numel() == 10000 for 200000 times, dtype=torch.int8 0.5838428330025636 a - b, numel() == 100000 for 20000 times, dtype=torch.int8 0.37486923701362684 ```

`++ i` --> `++i`

xuhdev · 2020-09-19T05:59:49Z

@glaringlee Rebased, thanks

facebook-github-bot

@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

glaringlee

LGTM now.

facebook-github-bot · 2020-09-22T04:13:15Z

@glaringlee merged this pull request in 4b3046e.

pytorchbot added the open source label Sep 15, 2020

xuhdev requested review from EscapeZero, albanD, ezyang and mruberry September 16, 2020 01:13

ngimel requested review from VitalyFedyunin and glaringlee and removed request for EscapeZero, albanD and mruberry September 16, 2020 03:15

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 16, 2020

glaringlee reviewed Sep 16, 2020

View reviewed changes

xuhdev force-pushed the vec-int8 branch from b87fa1a to e2d2cbc Compare September 16, 2020 18:43

facebook-github-bot reviewed Sep 18, 2020

View reviewed changes

xuhdev added 3 commits September 18, 2020 22:59

Update aten/src/ATen/cpu/vec256/vec256_int.h

c82435b

`++ i` --> `++i`

Add missing __at_align32__

1fe4d7a

xuhdev force-pushed the vec-int8 branch from e2d2cbc to 1fe4d7a Compare September 19, 2020 05:59

xuhdev requested a review from glaringlee September 19, 2020 05:59

facebook-github-bot reviewed Sep 21, 2020

View reviewed changes

glaringlee approved these changes Sep 21, 2020

View reviewed changes

facebook-github-bot closed this in 4b3046e Sep 22, 2020

xuhdev deleted the vec-int8 branch September 22, 2020 03:20

facebook-github-bot added the merged label Sep 22, 2020

mruberry added the Merged label Oct 28, 2020

Vectorize int8_t on CPU #44759

Vectorize int8_t on CPU #44759

Uh oh!

Conversation

xuhdev commented Sep 15, 2020

Uh oh!

vadimkantorov commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuhdev commented Sep 16, 2020

Uh oh!

codecov bot commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ezyang commented Sep 16, 2020

Uh oh!

glaringlee left a comment

Choose a reason for hiding this comment

Uh oh!

glaringlee Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

xuhdev Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glaringlee Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

xuhdev Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

glaringlee Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

glaringlee commented Sep 18, 2020

Uh oh!

xuhdev commented Sep 19, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

glaringlee left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vadimkantorov commented Sep 16, 2020 •

edited

Loading

codecov bot commented Sep 16, 2020 •

edited

Loading

glaringlee Sep 16, 2020 •

edited

Loading

dr-ci bot commented Sep 16, 2020 •

edited

Loading