-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Vectorize int8_t on CPU #44759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize int8_t on CPU #44759
Conversation
|
Does it also apply to unsigned uint8_t? Or a separate code changes are needed? |
|
The plan is to do uint8 in a separate PR, because the dealing with uint8 is likely very different from int8 (e.g., availability of intrinsic instructions, unsignability, etc.). This PR is already large enough and it makes sense to break things down to 2 PRs. |
Codecov Report
@@ Coverage Diff @@
## master #44759 +/- ##
=======================================
Coverage 67.83% 67.83%
=======================================
Files 384 384
Lines 49962 49962
=======================================
+ Hits 33892 33894 +2
+ Misses 16070 16068 -2
Continue to review full report at Codecov.
|
|
Very nice! I'll let Vitaly or Xinyu look at this for now, but holler if you need unblocking |
glaringlee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuhdev
slightly commented. Looks good to me otherwise. cc @VitalyFedyunin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It seems this is not changed in this PR, but I just curious why we use != here instead of <. @VitalyFedyunin @ezyang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simply mirrors other integer vec256 classes. @VitalyFedyunin and @ezyang might know the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these 2 lines be __at_align32__?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've added them here. I think the bigger danger is that storeu as defined for integer Vec256 classes requires the parameter to be aligned (e.g., see
pytorch/aten/src/ATen/cpu/vec256/vec256_int.h
Line 452 in 3e6bb52
| _mm256_storeu_si256(reinterpret_cast<__m256i*>(ptr), values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuhdev
Yep, this is exactly the place I've looked at.
Let's leave it for this PR and open a new issue specifically for this unprotected behavior. Please cc @VitalyFedyunin @ezyang .
💊 CI failures summary and remediationsAs of commit 1fe4d7a (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 8 times. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@xuhdev Can you please do a rebase. I think this is good to land now. |
int8_t is not vectorized in vec256_int.h. This PR adds vectorization for int8_t. Related issue: pytorch#43033 Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build): ```python import timeit dtype = 'torch.int8' for op in ('+', '-'): for n, t in [(10_000, 200000), (100_000, 20000)]: print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}') print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t)) ``` Results: Before: ``` a + b, numel() == 10000 for 200000 times, dtype=torch.int8 1.2223373489978258 a + b, numel() == 100000 for 20000 times, dtype=torch.int8 0.6108450189931318 a - b, numel() == 10000 for 200000 times, dtype=torch.int8 1.256775538000511 a - b, numel() == 100000 for 20000 times, dtype=torch.int8 0.6101213909860235 ``` After: ``` a + b, numel() == 10000 for 200000 times, dtype=torch.int8 0.5713336059998255 a + b, numel() == 100000 for 20000 times, dtype=torch.int8 0.39169703199877404 a - b, numel() == 10000 for 200000 times, dtype=torch.int8 0.5838428330025636 a - b, numel() == 100000 for 20000 times, dtype=torch.int8 0.37486923701362684 ```
`++ i` --> `++i`
|
@glaringlee Rebased, thanks |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glaringlee has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
glaringlee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
|
@glaringlee merged this pull request in 4b3046e. |
int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in #43033, this is an important type for vectorization because
a lot of images are loaded in this data type.
Related issue: #43033
Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):
Results:
Before:
After: