-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Support aarch32 neon backend for Vec256 #41267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 32994b9 Pull Request resolved: #41267
💊 CI failures summary and remediationsAs of commit 568d9d6 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
1 failure confirmed as flaky and can be ignored:
🚧 2 fixed upstream failures:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
|
|
||
| template<> | ||
| inline float32x4_t fill_mask<0, true>(float32x4_t mask){ | ||
| static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a constexpr instead of static. If not it could be const whether you decide to have it as static or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I tried that, but I dont think it worked for the intrinsics datatype.
| template<> | ||
| inline float32x4_t fill_mask<0, true>(float32x4_t mask){ | ||
| static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0}; | ||
| float32x4_t tmp_mask = vreinterpretq_f32_u32(int_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you managed to make the above constexpr without running into compilation issues, you might be able to make this constexpr to push the computation to compile-time since in theory the mask is completely known at compile-time.
Same two comments for similar cases below as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thats what I anted to do, but that did not work. I am also surprised that I have vreinterpretq_f32_u32 left here. Need to fix this, else compiler seg faults. Although in this case it probably did not, it seems.
| "vorr %[in_mask], %[in_mask], %[in_tmp]\n\t" | ||
| : [in_mask] "+w" (mask) | ||
| : [in_tmp] "w" (tmp_mask) | ||
| : ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to keep in mind is to add "cc" to the clobber list if the instruction modifies the flag register. You have to check the reference manual for all of the instructions you are using as inline assembly in this PR to see if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
| return 8; | ||
| } | ||
| Vec256() {} | ||
| Vec256(float32x4x2_t v) : values(v) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally mark as const all parameters that do not change. :) Making mistakes is easy and I think it's good practice to program defensively.
| { | ||
| Vec256<float> vec; | ||
| static uint32x4_t mask_low = {0xFFFFFFFF, 0x0, 0x0, 0x0}; | ||
| vec.values.val[0] = (float32x4_t)mask_low; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me what this does is take UINT_MAX (i.e. 0xFFFFFFFF == 4294967295) and store the equivalent IEEE-754 FP32 bit pattern that's closest to representing 4294967295.0f (i.e. 0x4F800000) into val[0]. Is that what you intended? If you wanted the 0xFFFFFFFF bit pattern you should either use vreinterpretq_f32_u32 or *(const float32x4_t*)(&mask_low).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the type is uint32x4_t I am not sure how meaningful it is to cast to a different pointer type. On the other hand use of intrinsics was resulting in seg fault. I just look the intrinsic up in arm_neon.h and it was doing just (float32x4_t), so I replaced that with it. Although not sure why the similar use of intrinsic elsewhere did not cause problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
| __asm__ __volatile__ ( | ||
| "vmov.f32 q3, %[in_low]\n\t" | ||
| "vmov.f32 q4, %[in_high]\n\t" | ||
| "vst1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can align tmp_values and use the @align property of store for a slightly faster execution.
| : | ||
| : [in_ptr] "r" (tmp_values), | ||
| [in_low] "w" (values.val[0]), [in_high] "w" (values.val[1]) | ||
| : "q3", "q4"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again "memory" here and everywhere else you are doing a read or write memory access."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes ordering requirement stringent and prevents compiler from doing any optimization, as that makes it look like a barrier, I think. Hence I did not do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is optional. I was under the impression that it is required for correct behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi, so this one seems to suggest that it prevents compiler from reordering loads/stores before "asm". https://stackoverflow.com/questions/14449141/the-difference-between-asm-asm-volatile-and-clobbering-memory. Basically it makes it assume that you dont know what memory gets written inside "asm" so dont reorder load/stores. But since address registers are linked in the asm I assume compiler understands what gets written? I was trying not to put constraint on compiler optimization.
Seems like modifying this to "rm" should be ok: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi, doesnt seem that I can get "rm" to work so just added "memory" to the clobbers list as you suggested.
| return Vec256<float>(low, high); | ||
| } | ||
| else { | ||
| __at_align32__ float tmp_values[size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK it seems it is how PyTorch prefers to do it probably a practice from before alignas was added to the language.
| // this should be removed. TODO (kimishpatel) | ||
| const float operator[](int idx) const { | ||
| __at_align32__ float tmp[size()]; | ||
| store(tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a store() and a loadu(). You can consider adding aligned versions of these functions for internal use in this class, and then use them where you have control over the alignment of data like here. Just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know for sure if it makes perf difference. It probably avoids address alignment check, but my understanding is that if the address is aligned there is no perf penalty but if it is not then there is one cycle penalty. In x86 if you use aligned vs of load with unaligned address you get exception, not sure if thats the case here but either way it not known to me if this introduces performance penalty.
| Vec256<float> operator==(const Vec256<float>& other) const { | ||
| float32x4_t res0, res1; | ||
| __asm__ ( | ||
| "vceq.f32 %[res0], %[in_0], %[other_in_0]\n\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this modify the flags register. I think not, but if it does you need to add "cc" to the clobber list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check. Thanks for this catch.
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) [ghstack-poisoned]
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b415552 Pull Request resolved: #41267
| template<> | ||
| inline float32x4_t fill_mask<0, true>(float32x4_t mask){ | ||
| static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0}; | ||
| float32x4_t tmp_mask = (float32x4_t)(int_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure this is the correct portable usage. The fact that the compiler crashes on a use of vreinterpretq_f32_s32 and accepts (float32x4_t) does not necessarily mean that one is logically correct and the other is not. There are places online that have uses of vreinterpret_f32_u32, for instance here: https://skia.googlesource.com/skia/+/refs/heads/chrome/m54/src/opts/SkNx_neon.h#57, and logically speaking your usage looks like a reinterpret_cast and I am puzzled what the underlying issue is.
It seems that arm_neon implementations of vreinterpretq_f32_s8 on Clang indeed does use a C style cast, but GCC is not. For instance take a look arm_neon.h here, or here both of which seem to be GCC's implementation that is using an internal compiler intrinsic so this approach is at least not portable even if from a practical standpoint both iOS and recent versions of Android use Clang.
Do we have any tests covering this code path? Have those tests run and passed on Android and iPhone (considering that our CI doesn't run any tests on a phone)?
What do you think @dreiss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the compiler crashes on a use of vreinterpretq_f32_s32 and accepts (float32x4_t) does not necessarily mean that one is logically correct and the other is not.
I agree, but that is not the argument here. Reason for crash is that there is some interaction between using intrinsics and optimization path in 32bit compilation.
android/ndk#1248
https://bugs.llvm.org/show_bug.cgi?id=45824
I tried infact to keep only that inline assembly that was needed and tried to use intrinsics everywhere else. However this is not about which intrinsics are ok or not. Depending on the optimization path compiler takes it crashes or does not. So I cannot pick and chose where to use intrinsics and where not.
It seems that arm_neon implementations of vreinterpretq_f32_s8 on Clang indeed does use a C style cast, but GCC is not. For instance take a look arm_neon.h here, or here both of which seem to be GCC's implementation that is using an internal compiler intrinsic so this approach is at least not portable even if from a practical standpoint both iOS and recent versions of Android use Clang.
That is a fair point and we can do internal runs and test this with GCC to if that is sufficient.
But without that, either we put compiler specific ifdefs and enable this only for clang or not do 32bit arm neon backend at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's still possible to build PyTorch for ARM with gcc (on Raspberry Pi, for example), so we should keep that working. I think a compiler-specific ifdef with a note about why is fine.
| } | ||
| static Vec256<float> loadu(const void* ptr, int64_t count = size()) { | ||
| if (count == size()) { | ||
| volatile register float32x4_t low asm("q3"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well what this means is that we are only postponing this issue since whoever will port PyTorch to C++-17 in a year or two will have to deal with this. :) Anyhow I am OK with that if you and David are.
| volatile register float32x4_t low asm("q3"); | ||
| volatile register float32x4_t high asm("q4"); | ||
| __asm__ __volatile__ ( | ||
| "vld1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is microarchitecture dependent but generally there is a performance benefit to aligned loads and stores.
You can also specify an alignment for the pointer passed in Rn, using the optional : parameter, which often speeds up memory accesses.
You can see that your particular use (i.e. VLDn {Dd, D(d+1), D(d+2), D(d+3)}, Rn) can accept three different, increasingly more strict, alignments in the table on VLDn's reference page.
|
LGTM except for the two issues. It would be great to see what David thinks on it. |
@AshkanAliabadi, so hopefully by then compiler issue is resolved and then we can remove the inline assembly path completely. If you have a better suggestion that avoids moves in inline assembly, I would be happy to do that way. I could not think of any. |
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) [ghstack-poisoned]
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0681cda Pull Request resolved: #41267
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) Benchmark results: | Time (ns) Baseline | Time (ns) 32 bit arm neon backend vec256 | Speedup -- | -- | -- | -- tensor_add/N:8/C:8 | 2655 | 2567 | 1.03428126 tensor_add/N:8/C:16 | 2805 | 2630 | 1.06653992 tensor_add/N:8/C:32 | 3303 | 2814 | 1.17377399 tensor_add/N:8/C:64 | 4152 | 3156 | 1.31558935 tensor_add/N:8/C:128 | 5761 | 3767 | 1.52933369 tensor_add/N:8/C:256 | 9234 | 5610 | 1.6459893 tensor_add/N:8/C:512 | 15685 | 8952 | 1.75212243 tensor_add/N:16/C:8 | 2905 | 2630 | 1.10456274 tensor_add/N:16/C:16 | 3428 | 2713 | 1.26354589 tensor_add/N:16/C:32 | 3958 | 3098 | 1.27759845 tensor_add/N:16/C:64 | 5548 | 3864 | 1.43581781 tensor_add/N:16/C:128 | 9343 | 5540 | 1.68646209 tensor_add/N:16/C:256 | 16239 | 9274 | 1.75102437 tensor_add/N:16/C:512 | 29663 | 14772 | 2.00805578 tensor_add/N:32/C:8 | 3218 | 2721 | 1.18265344 tensor_add/N:32/C:16 | 4006 | 3178 | 1.26054122 tensor_add/N:32/C:32 | 5576 | 3921 | 1.4220862 tensor_add/N:32/C:64 | 9263 | 5614 | 1.64998219 tensor_add/N:32/C:128 | 16464 | 8593 | 1.91597812 tensor_add/N:32/C:256 | 29613 | 14656 | 2.02053766 tensor_add/N:32/C:512 | 82981 | 56478 | 1.46926237 tensor_add/N:64/C:8 | 4078 | 3188 | 1.27917189 tensor_add/N:64/C:16 | 5780 | 3935 | 1.46886912 tensor_add/N:64/C:32 | 9367 | 5382 | 1.74043107 tensor_add/N:64/C:64 | 16069 | 8658 | 1.85597136 tensor_add/N:64/C:128 | 28780 | 15342 | 1.87589623 tensor_add/N:64/C:256 | 80601 | 57625 | 1.39871584 tensor_add/N:64/C:512 | 172897 | 100800 | 1.71524802 tensor_add/N:128/C:8 | 5771 | 3800 | 1.51868421 tensor_add/N:128/C:16 | 9408 | 5496 | 1.71179039 tensor_add/N:128/C:32 | 15836 | 8735 | 1.81293646 tensor_add/N:128/C:64 | 29315 | 15229 | 1.92494583 tensor_add/N:128/C:128 | 88595 | 51456 | 1.72176228 tensor_add/N:128/C:256 | 160211 | 101824 | 1.57341098 tensor_add/N:128/C:512 | 218426 | 165128 | 1.32276779 tensor_add/N:256/C:8 | 9194 | 5538 | 1.66016612 tensor_add/N:256/C:16 | 15659 | 9036 | 1.73295706 tensor_add/N:256/C:32 | 29465 | 15246 | 1.93263807 tensor_add/N:256/C:64 | 85880 | 50658 | 1.69528998 tensor_add/N:256/C:128 | 160866 | 102213 | 1.57383112 tensor_add/N:256/C:256 | 236656 | 161019 | 1.4697396 tensor_add/N:256/C:512 | 268035 | 236729 | 1.13224404 tensor_add/N:512/C:8 | 16304 | 9000 | 1.81155556 tensor_add/N:512/C:16 | 29797 | 15173 | 1.96381731 tensor_add/N:512/C:32 | 82945 | 53583 | 1.5479723 tensor_add/N:512/C:64 | 163287 | 107626 | 1.51717057 tensor_add/N:512/C:128 | 207900 | 145849 | 1.42544687 tensor_add/N:512/C:256 | 284782 | 233692 | 1.21862109 tensor_add/N:512/C:512 | 798740 | 529302 | 1.50904399 [ghstack-poisoned]
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: dee3748 Pull Request resolved: #41267
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) Benchmark results: | Time (ns) Baseline | Time (ns) 32 bit arm neon backend vec256 | Speedup -- | -- | -- | -- tensor_add/N:8/C:8 | 2655 | 2567 | 1.03428126 tensor_add/N:8/C:16 | 2805 | 2630 | 1.06653992 tensor_add/N:8/C:32 | 3303 | 2814 | 1.17377399 tensor_add/N:8/C:64 | 4152 | 3156 | 1.31558935 tensor_add/N:8/C:128 | 5761 | 3767 | 1.52933369 tensor_add/N:8/C:256 | 9234 | 5610 | 1.6459893 tensor_add/N:8/C:512 | 15685 | 8952 | 1.75212243 tensor_add/N:16/C:8 | 2905 | 2630 | 1.10456274 tensor_add/N:16/C:16 | 3428 | 2713 | 1.26354589 tensor_add/N:16/C:32 | 3958 | 3098 | 1.27759845 tensor_add/N:16/C:64 | 5548 | 3864 | 1.43581781 tensor_add/N:16/C:128 | 9343 | 5540 | 1.68646209 tensor_add/N:16/C:256 | 16239 | 9274 | 1.75102437 tensor_add/N:16/C:512 | 29663 | 14772 | 2.00805578 tensor_add/N:32/C:8 | 3218 | 2721 | 1.18265344 tensor_add/N:32/C:16 | 4006 | 3178 | 1.26054122 tensor_add/N:32/C:32 | 5576 | 3921 | 1.4220862 tensor_add/N:32/C:64 | 9263 | 5614 | 1.64998219 tensor_add/N:32/C:128 | 16464 | 8593 | 1.91597812 tensor_add/N:32/C:256 | 29613 | 14656 | 2.02053766 tensor_add/N:32/C:512 | 82981 | 56478 | 1.46926237 tensor_add/N:64/C:8 | 4078 | 3188 | 1.27917189 tensor_add/N:64/C:16 | 5780 | 3935 | 1.46886912 tensor_add/N:64/C:32 | 9367 | 5382 | 1.74043107 tensor_add/N:64/C:64 | 16069 | 8658 | 1.85597136 tensor_add/N:64/C:128 | 28780 | 15342 | 1.87589623 tensor_add/N:64/C:256 | 80601 | 57625 | 1.39871584 tensor_add/N:64/C:512 | 172897 | 100800 | 1.71524802 tensor_add/N:128/C:8 | 5771 | 3800 | 1.51868421 tensor_add/N:128/C:16 | 9408 | 5496 | 1.71179039 tensor_add/N:128/C:32 | 15836 | 8735 | 1.81293646 tensor_add/N:128/C:64 | 29315 | 15229 | 1.92494583 tensor_add/N:128/C:128 | 88595 | 51456 | 1.72176228 tensor_add/N:128/C:256 | 160211 | 101824 | 1.57341098 tensor_add/N:128/C:512 | 218426 | 165128 | 1.32276779 tensor_add/N:256/C:8 | 9194 | 5538 | 1.66016612 tensor_add/N:256/C:16 | 15659 | 9036 | 1.73295706 tensor_add/N:256/C:32 | 29465 | 15246 | 1.93263807 tensor_add/N:256/C:64 | 85880 | 50658 | 1.69528998 tensor_add/N:256/C:128 | 160866 | 102213 | 1.57383112 tensor_add/N:256/C:256 | 236656 | 161019 | 1.4697396 tensor_add/N:256/C:512 | 268035 | 236729 | 1.13224404 tensor_add/N:512/C:8 | 16304 | 9000 | 1.81155556 tensor_add/N:512/C:16 | 29797 | 15173 | 1.96381731 tensor_add/N:512/C:32 | 82945 | 53583 | 1.5479723 tensor_add/N:512/C:64 | 163287 | 107626 | 1.51717057 tensor_add/N:512/C:128 | 207900 | 145849 | 1.42544687 tensor_add/N:512/C:256 | 284782 | 233692 | 1.21862109 tensor_add/N:512/C:512 | 798740 | 529302 | 1.50904399 [ghstack-poisoned]
Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 803507f Pull Request resolved: #41267
AshkanAliabadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Kimish.
|
This pull request has been merged in dede71d. |
| volatile register float32x4_t low asm("q3"); | ||
| volatile register float32x4_t high asm("q4"); | ||
| __asm__ __volatile__ ( | ||
| "vld1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please try to use vld1.32 {%S0 - %V0} instead of naming registers d6 to d9?
Unfortunately this will work only with gcc, and not with clang.
Stack from ghstack:
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.
Test Plan:
vec256_test run on android phone.
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D22482196
Benchmark results: