Skip to content

Conversation

@quickwritereader
Copy link
Contributor

@quickwritereader quickwritereader commented Jul 16, 2020

Pytorch Vec256 ppc64le support

implemented types:

  • double
  • float
  • int16
  • int32
  • int64
  • qint32
  • qint8
  • quint8
  • complex_float
  • complex_double

Notes:
All basic vector operations are implemented:
There are a few problems:

  • minimum maximum nan propagation for ppc64le is missing and was not checked
  • complex multiplication, division, sqrt, abs are implemented as PyTorch x86. they can overflow and have precision problems than std ones. That's why they were either excluded or tested in smaller domain range
  • precisions of the implemented float math functions

Besides, I added CPU_CAPABILITY for power. but as because of quantization errors for DEFAULT I had to undef and use vsx for DEFAULT too

Details

Supported math functions
  • plus sign means vectorized, - minus sign means missing, (implementation notes are added inside braces)
    (notes). Example: -(both ) means it was also missing on x86 side
    g( func_name) means vectorization is using func_name
    sleef - redirected to the Sleef
    unsupported
function_name float double complex float complex double
acos sleef sleef f(asin) f(asin)
asin sleef sleef +(pytorch impl) +(pytorch impl)
atan sleef sleef f(log) f(log)
atan2 sleef sleef unsupported unsupported
cos +((ppc64le:avx_mathfun) ) sleef -(both) -(both)
cosh f(exp) -(both) -(both)
erf sleef sleef unsupported unsupported
erfc sleef sleef unsupported unsupported
erfinv - (both) - (both) unsupported unsupported
exp + sleef - (x86:f()) - (x86:f())
expm1 f(exp) sleef unsupported unsupported
lgamma sleef sleef
log + sleef -(both) -(both)
log10 f(log) sleef f(log) f(log)
log1p f(log) sleef unsupported unsupported
log2 f(log) sleef f(log) f(log)
pow + f(exp) sleef -(both) -(both)
sin +((ppc64le:avx_mathfun) ) sleef -(both) -(both)
sinh f(exp) sleef -(both) -(both)
tan sleef sleef -(both) -(both)
tanh f(exp) sleef -(both) -(both)
hypot sleef sleef -(both) -(both)
nextafter sleef sleef -(both) -(both)
fmod sleef sleef -(both) -(both)

Vec256 Test cases Pr #42685
Current list:

  • Blends
  • Memory: UnAlignedLoadStore
  • Arithmetics: Plus,Minu,Multiplication,Division
  • Bitwise: BitAnd, BitOr, BitXor
  • Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
  • MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
  • SignManipulation: Absolute, Negate
  • Interleave: Interleave, DeInterleave
  • Rounding: Round, Ceil, Floor, Trunc
  • Mask: ZeroMask
  • SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
  • Trigonometric: Sin, Cos, Tan
  • Hyperbolic: Tanh, Sinh, Cosh
  • InverseTrigonometric: Asin, ACos, ATan, ATan2
  • Logarithm: Log, Log2, Log10, Log1p
  • Exponents: Exp, Expm1
  • ErrorFunctions: Erf, Erfc, Erfinv
  • Pow: Pow
  • LGamma: LGamma
  • Quantization: quantize, dequantize, requantize_from_int
  • Quantization: widening_subtract, relu, relu6
    Missing:
  • Constructors, initializations
  • Conversion , Cast
  • Additional: imag, conj, angle (note: imag and conj only checked for float complex)

Notes on tests and testing framework

  • some math functions are tested within domain range
  • mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
  • some functions are tested against the local version. For example, std::round and vector version of round differs. so it was tested against the local version
  • round was tested against pytorch at::native::round_impl. for double type on Vsx vec_round failed for (even)+0 .5 values . it was solved by using vec_rint
  • complex types are not tested After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain
  • quantizations are not tested Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
  • the testing framework should be improved further
  • For now -DBUILD_MOBILE_TEST=ON will be used for Vec256Test too
    Vec256 Test cases will be built for each CPU_CAPABILITY

@quickwritereader
Copy link
Contributor Author

#27678
#15676

@dr-ci
Copy link

dr-ci bot commented Jul 16, 2020

💊 CI failures summary and remediations

As of commit 97c8e17 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 162 times.

@quickwritereader quickwritereader force-pushed the vsx_initial_support_issue27678 branch from 4925d8f to ccab406 Compare July 16, 2020 17:36
@smessmer smessmer added module: vectorization Related to SIMD vectorization, e.g., Vec256 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 16, 2020
@quickwritereader quickwritereader force-pushed the vsx_initial_support_issue27678 branch 6 times, most recently from 19959fe to 84f7d05 Compare July 17, 2020 02:30
@ezyang
Copy link
Contributor

ezyang commented Jul 28, 2020

@smessmer in the future, please don't mark an OSS PR as triaged without also assigning a reviewer for it

@ezyang ezyang removed the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 28, 2020
@ezyang ezyang requested a review from VitalyFedyunin July 28, 2020 14:35
@ezyang
Copy link
Contributor

ezyang commented Jul 28, 2020

Not to sure who should be the shepherd for this. @VitalyFedyunin do you have any ideas?

@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 31, 2020
@quickwritereader
Copy link
Contributor Author

New modifications on Vec256Tests:

  • Absolute tests are tested against local_abs. For the CPU_CAPABILITY_DEFAULT both complex and real will be redirected to std::abs
  • LGamma will be checked with error epsilon so that it can pass tests on x86
  • for AVX2 and VSX: DeQuantize tests are tested with __builtin_fmaf if the compiler is GNU. For other compilers it will be tested with low precision error epsilon 1e-3

@quickwritereader
Copy link
Contributor Author

For now AVX2 and VSX both fails on these tests
[ FAILED ] InverseTrigonometric/2.Asin, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/2.ACos, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/2.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/3.Asin, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/3.ACos, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >
[ FAILED ] InverseTrigonometric/3.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >

Additionally VSX fails on Multiplication because of precision
[ FAILED ] Arithmetics/2.Multiplication, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex >

@quickwritereader
Copy link
Contributor Author

quickwritereader commented Aug 3, 2020

Failures above are precision related and tests can be checked within the domain and with low precision.
But ATan for both VSX and AVX has sign related error. As VSX complex is simply a translation of AVX codes for complex numbers:

[ RUN      ] InverseTrigonometric/2.ATan
Total Trial Count:131070
Domain:
{ -10, 10 }
Error epsilon: 1e-05
../../../aten/src/ATen/test/Vec256Test.h:491: Failure
Expected equality of these values:
  nearlyEqual(exp.real(), act.real(), absErr.real())
    Which is: false
  true
-1.5707963705062866 1.5707963705062866
atan: {
vec[(-5.22354,-4), (-4.52348,2.63532), (-5.75149,0.258563), (-0,-1.43798)]
vec_exp:vec[(-1.44969,-0.0913255), (-1.40576,0.0938578), (-1.39898,0.00757274), (-1.5708,-0.858373)]
vec_act:vec[(-1.44969,-0.0913255), (-1.40576,0.0938578), (-1.39898,0.00757276), (1.5708,-0.858373)]
}
[  FAILED  ] InverseTrigonometric/2.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex<float> > (34 ms)
[----------] 1 test from InverseTrigonometric/2 (34 ms total)

[----------] 1 test from InverseTrigonometric/3, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex<double> >
[ RUN      ] InverseTrigonometric/3.ATan
Total Trial Count:131070
Domain:
{ -10, 10 }
Error epsilon: 1e-05
../../../aten/src/ATen/test/Vec256Test.h:460: Failure
Expected equality of these values:
  nearlyEqual(exp.real(), act.real(), absErr.real())
    Which is: false
  true
1.5707963267948966 -1.5707963267948966
atan: {
vec[(-5,2.47906), (0,6.06085)]
vec_exp:vec[(-1.41065,0.0777398), (1.5708,0.166516)]
vec_act:vec[(-1.41065,0.0777398), (-1.5708,0.166516)]
}
[  FAILED  ] InverseTrigonometric/3.ATan, where TypeParam = at::vec256::(anonymous namespace)::Vec256<c10::complex<double> > (1 ms)

@quickwritereader quickwritereader changed the title Vsx initial support issue27678 Vsx initial support issue27678 and Vec256 tests Aug 3, 2020
@quickwritereader
Copy link
Contributor Author

@VitalyFedyunin if possible, could we discuss my Pr this week
thanks

@ezyang
Copy link
Contributor

ezyang commented Aug 4, 2020

Do we know anyone at IBM who could help review this?

@ezyang
Copy link
Contributor

ezyang commented Aug 4, 2020

@xuhdev / @zasdfgbnm is this the sort of thing either of you would be interested in? :)

@zasdfgbnm
Copy link
Collaborator

I know very little about CPU vectorization :)

@VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin if possible, could we discuss my Pr this week
thanks

hello, I have little PowerPC experience, but given time I can potentially review and land this proposal.

Several notes so far:

  1. It makes sense to separate tests PR and VSX support PRs (I advice to use ghstack )
  2. aten/src/ATen/cpu/vec256/vec256.h becomes messy with the mix of VSX and AVX perhaps we need to postfix everything properly as they are mutually exclusive.
  3. Do you have any benchmark numbers to prove that adding VSX instructions performs better than compiler optimization?

@quickwritereader
Copy link
Contributor Author

quickwritereader commented Aug 4, 2020

@VitalyFedyunin if possible, could we discuss my Pr this week
thanks

hello, I have little PowerPC experience, but given time I can potentially review and land this proposal.

Several notes so far:

  1. It makes sense to separate tests PR and VSX support PRs (I advice to use ghstack )
  2. aten/src/ATen/cpu/vec256/vec256.h becomes messy with the mix of VSX and AVX perhaps we need to postfix everything properly as they are mutually exclusive.
  3. Do you have any benchmark numbers to prove that adding VSX instructions performs better than compiler optimization?
  1. I'll see if I can separate them properly with the suggested tool.

  2. there are very few additions and modifications in vec256.h and as well as in other vector files
    in vec256.h only 1 common header was included (I suffixed it with _vsx) and stream helpers added for quant types.'

  3. No, I did not add any benchmark. I was asked by IBM to add initial support for VSX instructions as it is the same x86.
    I only checked with Godbolt's output to see if I am getting a decent output. I am emulating 256bit with two 128bit vectors. It is still vector intrinsics so it will mostly outperform and at least will not have broken auto-vectorizations in loops.

  4. some additional notes on performance and math functions:
    I implemented very few math functions as it is initial support and I could not decide to rely on Sleef or which implementations to use. But Sleef codes do not have 256bit simulation for VSX.
    There is a problem with non-implemented std map-ped functions. as with the current libm almost all math functions are not vectorizable. I can make them auto-vectorizable on x86 GCC assuming it will hold for Power architecture in the future. For gcc It needs --ffast-math flag addition and a little modification on map function so that auto-vectorization will not be broken.

    If this project relied on auto-vectorization and its report tools there would not be any need. We went this way for my team. We made a reporting tool for the project and we check to see where vectorization fails and we help the compiler. But there could be places its almost impossible to make it easier for the compiler to auto vectorize it.
    But this project relies on separate Vec256 so why not to use explicit vector types or intrinsic. From my experience compilers are still doing good with explicit codes.

Do we know anyone at IBM who could help review this?

@edelsohn or @ceseo could ask some IBM-ers to help us

#include <c10/macros/Macros.h>
#include <ATen/cpu/vec256/intrinsics.h>

using vbool8 = __attribute__((altivec(vector__))) __attribute__((altivec(bool__))) char;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replacing __attribute__((altivec(vector__))) by __vector for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do so because of GCC and default compilation. I opened the issue advance-toolchain/issues/1861.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, I see. Shame

@Flamefire
Copy link
Collaborator

Flamefire commented Dec 8, 2020

Having checked that further my biggest remaining complaint would be the code duplication. The added code is massive but if I e.g. do a search&replace on the double implementation to replace all "double" by "float" and all "vfloat64" by "vfloat32" there isn't much of a difference left. Likely similar for the other types. Also this way the differences stick out: E.g. using Sleef in double but some "obscure" code in float. Similar for some magic numbers used (1, 3, 7, ...). This makes it hard to verify correctness and if there is any issue it has to be fixed/improved in all (by now 10) instances.

How about having a Vec256VSX<T, VecType, MaskType> base class and then only subclasses which fill in gaps which are not shared. That shouldn't be much, e.g. the constructor taking 8 floats or 4 doubles is different, but the ones taking 2 vectors or masks are shared. With a bit of sizeof or size() most of the other stuff can be moved to the base class and using some TMP (or just some if with size() checks which will be optimized away) forces to replace magic numbers by meaningful stuff which makes reasoning easier (example: set method). Example:

template <>
class Vec256<float>: public Vec256VSX<float, vfloat32, vbool32>{
public:
using Parent = Vec256VSX<float, vfloat32, vbool32>;
using Parent::Parent;
Vec256<float>(scalar1, scalar2, ...): Parent(vfloat32(scalar1, ...), vfloat32(scalar5, ...)) {} // if this is even required...
...
}

Also the member ops can likely be done in the base class.

Just an idea to make this easier to develop and maintain. I'm not part of FB/PyTorch obviously so this is only from a fellow developer.

@quickwritereader
Copy link
Contributor Author

quickwritereader commented Dec 8, 2020

@Flamefire
yes, that's a good suggestion. We discussed such cases #41541 (comment). But I kept the original as my suggestion for removal sin, cos like ops from member ops declined.

@Flamefire
Copy link
Collaborator

Ah I see, the base class approach you suggested is also what I had in mind. Using std::plus is nice, but doesn't use vector intrinsics.
What do you mean by

suggestion for removal sin, cos like ops from member ops declined.

Do I read this correct as: Having the base class would not allow to implement the sin/cos as members in the base class template because there is no "1 function" to call?
If so, I don't see this as an exclusion point: You can implement all the common stuff including add/sub/... in the base template and for ones where overloads are missing (e.g. Sleef has different names for single and double precision) you can either add a new function overload set which wraps the sleef function or implement those in the sub class:

template <>
class Vec256<float>: public Vec256VSX<float, vfloat32, vbool32>{
public:
...
Vec256<float> cos(){ /*whatever*/ }
}

I think this still reduces the amount of code a lot. Currently 1 such class file is about 300-400 lines long and there are 10 so 3000-4000 LOC. I'd expect the base class to reduce that to at least half of that, maybe more. Just having the mask-vector-union and the ctors in a base class already helps as do the ops.

You might need some CRTP though, if you want the ops to return the Vec256<T> instance.

@quickwritereader
Copy link
Contributor Author

#41541 (comment)
no, I did not mean CRTP static polymorphism for sin like ops. I just meant to separate all such ops. we could implement some sin like ops using those types. without fully caring for architecture. And for PPC, it would make things concise
Well in everything there are cons and pros. It is initial support so there is always room for improvements

@VitalyFedyunin
Copy link
Contributor

Hello @Flamefire, we are looking to rewrite (and remove massive duplication) Vec code (eventually to support AVX512 and various vec sizes and architectures). So far we are happy with the code as is.

@VitalyFedyunin
Copy link
Contributor

@quickwritereader please scroll up for CLA information, without it I cannot import your code to run internal tests.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Jun 8, 2021
Summary:
This fixes multiple bugs introduced by the VSX optimized code in #41541

- min/max/clamp now consistently return nan when any value is NaN as on other architectures
- The non-complex angle functions return PI for negative values now
- The complex angle functions have been corrected and optimized
- The float32-log function implementation returned a wrong result when inf was passed (and maybe other inputs), replaced by the sleef function just as for float64

Fixes #59248
Fixes #57537

Pull Request resolved: #59382

Reviewed By: jbschlosser

Differential Revision: D28944626

Pulled By: ezyang

fbshipit-source-id: 1ae2782b9e34e458a19cec90617037654279e0e0
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
This fixes multiple bugs introduced by the VSX optimized code in pytorch#41541

- min/max/clamp now consistently return nan when any value is NaN as on other architectures
- The non-complex angle functions return PI for negative values now
- The complex angle functions have been corrected and optimized
- The float32-log function implementation returned a wrong result when inf was passed (and maybe other inputs), replaced by the sleef function just as for float64

Fixes pytorch#59248
Fixes pytorch#57537

Pull Request resolved: pytorch#59382

Reviewed By: jbschlosser

Differential Revision: D28944626

Pulled By: ezyang

fbshipit-source-id: 1ae2782b9e34e458a19cec90617037654279e0e0
pytorchmergebot pushed a commit that referenced this pull request Aug 3, 2022
This fixes the remaining bug introduced by the VSX optimized code in #41541

Followup to #59382

### Description

The code currently returns wrong results on POWER9LE making e.g. the `test_binary_ufuncs` fail.

### Testing

Build and ran tests on PPC
Pull Request resolved: #82646
Approved by: https://github.com/ezyang
facebook-github-bot pushed a commit that referenced this pull request Aug 4, 2022
Summary:
This fixes the remaining bug introduced by the VSX optimized code in #41541

Followup to #59382

### Description

The code currently returns wrong results on POWER9LE making e.g. the `test_binary_ufuncs` fail.

### Testing

Build and ran tests on PPC

Pull Request resolved: #82646
Approved by: https://github.com/ezyang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/39ffad392c49aafcfeba05e2704bb1b666247471

Reviewed By: kit1980

Differential Revision: D38395263

fbshipit-source-id: c4c56af2d8e3b528b6418a4a32c63de77037e5cf
pytorchmergebot pushed a commit to Flamefire/pytorch that referenced this pull request Oct 10, 2022
Replace the remaining hand-written code in vec256_float_vsx.h by calls
to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541
Also remove some whitespace wrongly added in the above PRs.

This fixes wrong results for e.g. `sin(1e20)`.
Fixes pytorch#85978
Flamefire added a commit to Flamefire/pytorch that referenced this pull request Nov 22, 2022
Replace the remaining hand-written code in vec256_float_vsx.h by calls
to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541
Also remove some whitespace wrongly added in the above PRs.

This fixes wrong results for e.g. `sin(1e20)`.
Fixes pytorch#85978
pytorchmergebot pushed a commit that referenced this pull request Nov 22, 2022
Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in #59382 & #82646 after #41541

This fixes wrong results for e.g. `sin(1e20)`.
Fixes #85978

To fix #85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done.

Pull Request resolved: #86453
Approved by: https://github.com/malfet
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541

This fixes wrong results for e.g. `sin(1e20)`.
Fixes pytorch#85978

To fix pytorch#85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done.

Pull Request resolved: pytorch#86453
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: vectorization Related to SIMD vectorization, e.g., Vec256 open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.