Vsx initial support issue27678 #41541

quickwritereader · 2020-07-16T17:10:01Z

Pytorch Vec256 ppc64le support

implemented types:

double
float
int16
int32
int64
qint32
qint8
quint8
complex_float
complex_double

Notes:
All basic vector operations are implemented:
There are a few problems:

minimum maximum nan propagation for ppc64le is missing and was not checked
complex multiplication, division, sqrt, abs are implemented as PyTorch x86. they can overflow and have precision problems than std ones. That's why they were either excluded or tested in smaller domain range
precisions of the implemented float math functions

~~Besides, I added CPU_CAPABILITY for power. but as because of quantization errors for DEFAULT I had to undef and use vsx for DEFAULT too~~

Details

Supported math functions

plus sign means vectorized, - minus sign means missing, (implementation notes are added inside braces)
(notes). Example: -(both ) means it was also missing on x86 side
g( func_name) means vectorization is using func_name
sleef - redirected to the Sleef
unsupported

function_name	float	double	complex float	complex double
acos	sleef	sleef	f(asin)	f(asin)
asin	sleef	sleef	+(pytorch impl)	+(pytorch impl)
atan	sleef	sleef	f(log)	f(log)
atan2	sleef	sleef	unsupported	unsupported
cos	+((ppc64le:avx_mathfun) )	sleef	-(both)	-(both)
cosh	f(exp)	-(both)	-(both)
erf	sleef	sleef	unsupported	unsupported
erfc	sleef	sleef	unsupported	unsupported
erfinv	- (both)	- (both)	unsupported	unsupported
exp	+	sleef	- (x86:f())	- (x86:f())
expm1	f(exp)	sleef	unsupported	unsupported
lgamma	sleef	sleef
log	+	sleef	-(both)	-(both)
log10	f(log)	sleef	f(log)	f(log)
log1p	f(log)	sleef	unsupported	unsupported
log2	f(log)	sleef	f(log)	f(log)
pow	+ f(exp)	sleef	-(both)	-(both)
sin	+((ppc64le:avx_mathfun) )	sleef	-(both)	-(both)
sinh	f(exp)	sleef	-(both)	-(both)
tan	sleef	sleef	-(both)	-(both)
tanh	f(exp)	sleef	-(both)	-(both)
hypot	sleef	sleef	-(both)	-(both)
nextafter	sleef	sleef	-(both)	-(both)
fmod	sleef	sleef	-(both)	-(both)

Vec256 Test cases Pr #42685
Current list:

Notes on tests and testing framework

some math functions are tested within domain range
mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
round was tested against pytorch at::native::round_impl. for double type on Vsx vec_round failed for (even)+0 .5 values . it was solved by using vec_rint
~~complex types are not tested~~ After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain
~~quantizations are not tested~~ Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
the testing framework should be improved further
~~For now -DBUILD_MOBILE_TEST=ON will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY

quickwritereader · 2020-07-16T17:13:29Z

#27678
#15676

dr-ci · 2020-07-16T17:21:26Z

💊 CI failures summary and remediations

As of commit 97c8e17 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 2 failed

Failed: GitHub Actions - clang-tidy
Failed: GitHub Actions - flake8-py3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 162 times.

ezyang · 2020-07-28T14:34:18Z

@smessmer in the future, please don't mark an OSS PR as triaged without also assigning a reviewer for it

ezyang · 2020-07-28T14:36:06Z

Not to sure who should be the shepherd for this. @VitalyFedyunin do you have any ideas?

quickwritereader · 2020-08-03T17:35:50Z

New modifications on Vec256Tests:

Absolute tests are tested against local_abs. For the CPU_CAPABILITY_DEFAULT both complex and real will be redirected to std::abs
LGamma will be checked with error epsilon so that it can pass tests on x86
for AVX2 and VSX: DeQuantize tests are tested with __builtin_fmaf if the compiler is GNU. For other compilers it will be tested with low precision error epsilon 1e-3

quickwritereader · 2020-08-03T17:37:14Z

For now AVX2 and VSX both fails on these tests
[ FAILED ] InverseTrigonometric/2.Asin, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/2.ACos, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/2.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/3.Asin, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/3.ACos, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/3.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >

Additionally VSX fails on Multiplication because of precision
[ FAILED ] Arithmetics/2.Multiplication, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >

quickwritereader · 2020-08-03T17:41:21Z

Failures above are precision related and tests can be checked within the domain and with low precision.
But ATan for both VSX and AVX has sign related error. As VSX complex is simply a translation of AVX codes for complex numbers:

[ RUN      ] InverseTrigonometric/2.ATan
Total Trial Count:131070
Domain:
{ -10, 10 }
Error epsilon: 1e-05
../../../aten/src/ATen/test/Vec256Test.h:491: Failure
Expected equality of these values:
  nearlyEqual(exp.real(), act.real(), absErr.real())
    Which is: false
  true
-1.5707963705062866 1.5707963705062866
atan: {
vec[(-5.22354,-4), (-4.52348,2.63532), (-5.75149,0.258563), (-0,-1.43798)]
vec_exp:vec[(-1.44969,-0.0913255), (-1.40576,0.0938578), (-1.39898,0.00757274), (-1.5708,-0.858373)]
vec_act:vec[(-1.44969,-0.0913255), (-1.40576,0.0938578), (-1.39898,0.00757276), (1.5708,-0.858373)]
}
[  FAILED  ] InverseTrigonometric/2.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex<float> > (34 ms)
[----------] 1 test from InverseTrigonometric/2 (34 ms total)

[----------] 1 test from InverseTrigonometric/3, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex<double> >
[ RUN      ] InverseTrigonometric/3.ATan
Total Trial Count:131070
Domain:
{ -10, 10 }
Error epsilon: 1e-05
../../../aten/src/ATen/test/Vec256Test.h:460: Failure
Expected equality of these values:
  nearlyEqual(exp.real(), act.real(), absErr.real())
    Which is: false
  true
1.5707963267948966 -1.5707963267948966
atan: {
vec[(-5,2.47906), (0,6.06085)]
vec_exp:vec[(-1.41065,0.0777398), (1.5708,0.166516)]
vec_act:vec[(-1.41065,0.0777398), (-1.5708,0.166516)]
}
[  FAILED  ] InverseTrigonometric/3.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex<double> > (1 ms)

quickwritereader · 2020-08-03T18:46:41Z

@VitalyFedyunin if possible, could we discuss my Pr this week
thanks

ezyang · 2020-08-04T03:14:59Z

Do we know anyone at IBM who could help review this?

ezyang · 2020-08-04T03:22:00Z

@xuhdev / @zasdfgbnm is this the sort of thing either of you would be interested in? :)

zasdfgbnm · 2020-08-04T03:35:11Z

I know very little about CPU vectorization :)

VitalyFedyunin · 2020-08-04T04:03:27Z

@VitalyFedyunin if possible, could we discuss my Pr this week
thanks

hello, I have little PowerPC experience, but given time I can potentially review and land this proposal.

Several notes so far:

It makes sense to separate tests PR and VSX support PRs (I advice to use ghstack )
aten/src/ATen/cpu/vec256/vec256.h becomes messy with the mix of VSX and AVX perhaps we need to postfix everything properly as they are mutually exclusive.
Do you have any benchmark numbers to prove that adding VSX instructions performs better than compiler optimization?

quickwritereader · 2020-08-04T08:03:53Z

@VitalyFedyunin if possible, could we discuss my Pr this week
thanks

hello, I have little PowerPC experience, but given time I can potentially review and land this proposal.

Several notes so far:

It makes sense to separate tests PR and VSX support PRs (I advice to use ghstack )

aten/src/ATen/cpu/vec256/vec256.h becomes messy with the mix of VSX and AVX perhaps we need to postfix everything properly as they are mutually exclusive.

Do you have any benchmark numbers to prove that adding VSX instructions performs better than compiler optimization?

I'll see if I can separate them properly with the suggested tool.
there are very few additions and modifications in vec256.h and as well as in other vector files
in vec256.h only 1 common header was included (I suffixed it with _vsx) and stream helpers added for quant types.'
No, I did not add any benchmark. I was asked by IBM to add initial support for VSX instructions as it is the same x86.
I only checked with Godbolt's output to see if I am getting a decent output. I am emulating 256bit with two 128bit vectors. It is still vector intrinsics so it will mostly outperform and at least will not have broken auto-vectorizations in loops.
some additional notes on performance and math functions:
I implemented very few math functions as it is initial support and I could not decide to rely on Sleef or which implementations to use. But Sleef codes do not have 256bit simulation for VSX.
There is a problem with non-implemented std map-ped functions. as with the current libm almost all math functions are not vectorizable. I can make them auto-vectorizable on x86 GCC assuming it will hold for Power architecture in the future. For gcc It needs --ffast-math flag addition and a little modification on map function so that auto-vectorization will not be broken.

If this project relied on auto-vectorization and its report tools there would not be any need. We went this way for my team. We made a reporting tool for the project and we check to see where vectorization fails and we help the compiler. But there could be places its almost impossible to make it easier for the compiler to auto vectorize it.
But this project relies on separate Vec256 so why not to use explicit vector types or intrinsic. From my experience compilers are still doing good with explicit codes.

Do we know anyone at IBM who could help review this?

@edelsohn or @ceseo could ask some IBM-ers to help us

Flamefire · 2020-12-08T08:37:37Z

aten/src/ATen/cpu/vec256/vsx/vsx_helpers.h

+#include <c10/macros/Macros.h>
+#include <ATen/cpu/vec256/intrinsics.h>
+
+using vbool8   =  __attribute__((altivec(vector__))) __attribute__((altivec(bool__))) char;


How about replacing __attribute__((altivec(vector__))) by __vector for readability?

I had to do so because of GCC and default compilation. I opened the issue advance-toolchain/issues/1861.

Ouch, I see. Shame

Flamefire · 2020-12-08T09:26:36Z

Having checked that further my biggest remaining complaint would be the code duplication. The added code is massive but if I e.g. do a search&replace on the double implementation to replace all "double" by "float" and all "vfloat64" by "vfloat32" there isn't much of a difference left. Likely similar for the other types. Also this way the differences stick out: E.g. using Sleef in double but some "obscure" code in float. Similar for some magic numbers used (1, 3, 7, ...). This makes it hard to verify correctness and if there is any issue it has to be fixed/improved in all (by now 10) instances.

How about having a Vec256VSX<T, VecType, MaskType> base class and then only subclasses which fill in gaps which are not shared. That shouldn't be much, e.g. the constructor taking 8 floats or 4 doubles is different, but the ones taking 2 vectors or masks are shared. With a bit of sizeof or size() most of the other stuff can be moved to the base class and using some TMP (or just some if with size() checks which will be optimized away) forces to replace magic numbers by meaningful stuff which makes reasoning easier (example: set method). Example:

template <>
class Vec256<float>: public Vec256VSX<float, vfloat32, vbool32>{
public:
using Parent = Vec256VSX<float, vfloat32, vbool32>;
using Parent::Parent;
Vec256<float>(scalar1, scalar2, ...): Parent(vfloat32(scalar1, ...), vfloat32(scalar5, ...)) {} // if this is even required...
...
}

Also the member ops can likely be done in the base class.

Just an idea to make this easier to develop and maintain. I'm not part of FB/PyTorch obviously so this is only from a fellow developer.

quickwritereader · 2020-12-08T09:37:28Z

@Flamefire
yes, that's a good suggestion. We discussed such cases #41541 (comment). But I kept the original as my suggestion for removal sin, cos like ops from member ops declined.

Flamefire · 2020-12-08T10:20:04Z

Ah I see, the base class approach you suggested is also what I had in mind. Using std::plus is nice, but doesn't use vector intrinsics.
What do you mean by

suggestion for removal sin, cos like ops from member ops declined.

Do I read this correct as: Having the base class would not allow to implement the sin/cos as members in the base class template because there is no "1 function" to call?
If so, I don't see this as an exclusion point: You can implement all the common stuff including add/sub/... in the base template and for ones where overloads are missing (e.g. Sleef has different names for single and double precision) you can either add a new function overload set which wraps the sleef function or implement those in the sub class:

template <>
class Vec256<float>: public Vec256VSX<float, vfloat32, vbool32>{
public:
...
Vec256<float> cos(){ /*whatever*/ }
}

I think this still reduces the amount of code a lot. Currently 1 such class file is about 300-400 lines long and there are 10 so 3000-4000 LOC. I'd expect the base class to reduce that to at least half of that, maybe more. Just having the mask-vector-union and the ctors in a base class already helps as do the ops.

You might need some CRTP though, if you want the ops to return the Vec256<T> instance.

quickwritereader · 2020-12-08T10:32:18Z

#41541 (comment)
no, I did not mean CRTP static polymorphism for sin like ops. I just meant to separate all such ops. we could implement some sin like ops using those types. without fully caring for architecture. And for PPC, it would make things concise
Well in everything there are cons and pros. It is initial support so there is always room for improvements

VitalyFedyunin · 2020-12-08T16:49:21Z

Hello @Flamefire, we are looking to rewrite (and remove massive duplication) Vec code (eventually to support AVX512 and various vec sizes and architectures). So far we are happy with the code as is.

VitalyFedyunin · 2020-12-08T16:50:24Z

@quickwritereader please scroll up for CLA information, without it I cannot import your code to run internal tests.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This fixes multiple bugs introduced by the VSX optimized code in #41541 - min/max/clamp now consistently return nan when any value is NaN as on other architectures - The non-complex angle functions return PI for negative values now - The complex angle functions have been corrected and optimized - The float32-log function implementation returned a wrong result when inf was passed (and maybe other inputs), replaced by the sleef function just as for float64 Fixes #59248 Fixes #57537 Pull Request resolved: #59382 Reviewed By: jbschlosser Differential Revision: D28944626 Pulled By: ezyang fbshipit-source-id: 1ae2782b9e34e458a19cec90617037654279e0e0

Summary: This fixes multiple bugs introduced by the VSX optimized code in pytorch#41541 - min/max/clamp now consistently return nan when any value is NaN as on other architectures - The non-complex angle functions return PI for negative values now - The complex angle functions have been corrected and optimized - The float32-log function implementation returned a wrong result when inf was passed (and maybe other inputs), replaced by the sleef function just as for float64 Fixes pytorch#59248 Fixes pytorch#57537 Pull Request resolved: pytorch#59382 Reviewed By: jbschlosser Differential Revision: D28944626 Pulled By: ezyang fbshipit-source-id: 1ae2782b9e34e458a19cec90617037654279e0e0

This fixes the remaining bug introduced by the VSX optimized code in #41541 Followup to #59382 ### Description The code currently returns wrong results on POWER9LE making e.g. the `test_binary_ufuncs` fail. ### Testing Build and ran tests on PPC Pull Request resolved: #82646 Approved by: https://github.com/ezyang

Summary: This fixes the remaining bug introduced by the VSX optimized code in #41541 Followup to #59382 ### Description The code currently returns wrong results on POWER9LE making e.g. the `test_binary_ufuncs` fail. ### Testing Build and ran tests on PPC Pull Request resolved: #82646 Approved by: https://github.com/ezyang Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/39ffad392c49aafcfeba05e2704bb1b666247471 Reviewed By: kit1980 Differential Revision: D38395263 fbshipit-source-id: c4c56af2d8e3b528b6418a4a32c63de77037e5cf

Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541 Also remove some whitespace wrongly added in the above PRs. This fixes wrong results for e.g. `sin(1e20)`. Fixes pytorch#85978

Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in #59382 & #82646 after #41541 This fixes wrong results for e.g. `sin(1e20)`. Fixes #85978 To fix #85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done. Pull Request resolved: #86453 Approved by: https://github.com/malfet

Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541 This fixes wrong results for e.g. `sin(1e20)`. Fixes pytorch#85978 To fix pytorch#85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done. Pull Request resolved: pytorch#86453 Approved by: https://github.com/malfet

This was referenced Jul 16, 2020

[feature request] Vec256 power le vsx support #27678

Closed

Tests for Vec256 classes #15676

Closed

pytorchbot added the open source label Jul 16, 2020

quickwritereader force-pushed the vsx_initial_support_issue27678 branch from 4925d8f to ccab406 Compare July 16, 2020 17:36

smessmer added module: vectorization Related to SIMD vectorization, e.g., Vec256 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 16, 2020

quickwritereader force-pushed the vsx_initial_support_issue27678 branch 6 times, most recently from 19959fe to 84f7d05 Compare July 17, 2020 02:30

quickwritereader mentioned this pull request Jul 26, 2020

[PT_BREAK] Default device xla:0 is not a TPU device pytorch/xla#2375

Closed

ezyang removed the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 28, 2020

ezyang requested a review from VitalyFedyunin July 28, 2020 14:35

mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 31, 2020

quickwritereader changed the title ~~Vsx initial support issue27678~~ Vsx initial support issue27678 and Vec256 tests Aug 3, 2020

Flamefire reviewed Dec 8, 2020

View reviewed changes

VitalyFedyunin approved these changes Dec 8, 2020

View reviewed changes

vsx: complex types use std::sqrt

97c8e17

facebook-github-bot reviewed Dec 9, 2020

View reviewed changes

facebook-github-bot added the cla signed label Dec 9, 2020

facebook-github-bot closed this in 95a1725 Dec 10, 2020

dncliss mentioned this pull request Jan 12, 2021

New VSX support breaks compilation if using g++ v7 (ppc64le) #50439

Closed

Flamefire mentioned this pull request May 4, 2021

Nan propagation of min/max function broken for POWER #57537

Closed

Flamefire mentioned this pull request Jun 3, 2021

Fix vectorized calculations on POWER #59382

Closed

This was referenced Jul 28, 2022

add and fix patches for PyTorch 1.9.0 on POWER easybuilders/easybuild-easyconfigs#15919

Merged

Fix faulty, vectorized pow function on VSX #82646

Closed

Flamefire mentioned this pull request Oct 7, 2022

Fix vectorized trigonometric functions for VSX #86453

Closed

Vsx initial support issue27678 #41541

Vsx initial support issue27678 #41541

Uh oh!

Conversation

quickwritereader commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pytorch Vec256 ppc64le support

Details

Supported math functions

Notes on tests and testing framework

Uh oh!

quickwritereader commented Jul 16, 2020

Uh oh!

dr-ci bot commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 2 failed

Uh oh!

ezyang commented Jul 28, 2020

Uh oh!

ezyang commented Jul 28, 2020

Uh oh!

quickwritereader commented Aug 3, 2020

Uh oh!

quickwritereader commented Aug 3, 2020

Uh oh!

quickwritereader commented Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quickwritereader commented Aug 3, 2020

Uh oh!

ezyang commented Aug 4, 2020

Uh oh!

ezyang commented Aug 4, 2020

Uh oh!

zasdfgbnm commented Aug 4, 2020

Uh oh!

VitalyFedyunin commented Aug 4, 2020

Uh oh!

quickwritereader commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire Dec 8, 2020

Choose a reason for hiding this comment

Uh oh!

quickwritereader Dec 8, 2020

Choose a reason for hiding this comment

Uh oh!

Flamefire Dec 8, 2020

Choose a reason for hiding this comment

Uh oh!

Flamefire commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quickwritereader commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Dec 8, 2020

Uh oh!

quickwritereader commented Dec 8, 2020

Uh oh!

VitalyFedyunin commented Dec 8, 2020

Uh oh!

VitalyFedyunin commented Dec 8, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

quickwritereader commented Jul 16, 2020 •

edited

Loading

dr-ci bot commented Jul 16, 2020 •

edited

Loading

quickwritereader commented Aug 3, 2020 •

edited

Loading

quickwritereader commented Aug 4, 2020 •

edited

Loading

Flamefire commented Dec 8, 2020 •

edited

Loading

quickwritereader commented Dec 8, 2020 •

edited

Loading