Skip to content

Conversation

@kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented Jul 10, 2020

Stack from ghstack:

Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D22482196

Benchmark results:

  Time (ns) Baseline Time (ns) 32 bit arm neon backend vec256 Speedup
tensor_add/N:8/C:8 2655 2567 1.03428126
tensor_add/N:8/C:16 2805 2630 1.06653992
tensor_add/N:8/C:32 3303 2814 1.17377399
tensor_add/N:8/C:64 4152 3156 1.31558935
tensor_add/N:8/C:128 5761 3767 1.52933369
tensor_add/N:8/C:256 9234 5610 1.6459893
tensor_add/N:8/C:512 15685 8952 1.75212243
tensor_add/N:16/C:8 2905 2630 1.10456274
tensor_add/N:16/C:16 3428 2713 1.26354589
tensor_add/N:16/C:32 3958 3098 1.27759845
tensor_add/N:16/C:64 5548 3864 1.43581781
tensor_add/N:16/C:128 9343 5540 1.68646209
tensor_add/N:16/C:256 16239 9274 1.75102437
tensor_add/N:16/C:512 29663 14772 2.00805578
tensor_add/N:32/C:8 3218 2721 1.18265344
tensor_add/N:32/C:16 4006 3178 1.26054122
tensor_add/N:32/C:32 5576 3921 1.4220862
tensor_add/N:32/C:64 9263 5614 1.64998219
tensor_add/N:32/C:128 16464 8593 1.91597812
tensor_add/N:32/C:256 29613 14656 2.02053766
tensor_add/N:32/C:512 82981 56478 1.46926237
tensor_add/N:64/C:8 4078 3188 1.27917189
tensor_add/N:64/C:16 5780 3935 1.46886912
tensor_add/N:64/C:32 9367 5382 1.74043107
tensor_add/N:64/C:64 16069 8658 1.85597136
tensor_add/N:64/C:128 28780 15342 1.87589623
tensor_add/N:64/C:256 80601 57625 1.39871584
tensor_add/N:64/C:512 172897 100800 1.71524802
tensor_add/N:128/C:8 5771 3800 1.51868421
tensor_add/N:128/C:16 9408 5496 1.71179039
tensor_add/N:128/C:32 15836 8735 1.81293646
tensor_add/N:128/C:64 29315 15229 1.92494583
tensor_add/N:128/C:128 88595 51456 1.72176228
tensor_add/N:128/C:256 160211 101824 1.57341098
tensor_add/N:128/C:512 218426 165128 1.32276779
tensor_add/N:256/C:8 9194 5538 1.66016612
tensor_add/N:256/C:16 15659 9036 1.73295706
tensor_add/N:256/C:32 29465 15246 1.93263807
tensor_add/N:256/C:64 85880 50658 1.69528998
tensor_add/N:256/C:128 160866 102213 1.57383112
tensor_add/N:256/C:256 236656 161019 1.4697396
tensor_add/N:256/C:512 268035 236729 1.13224404
tensor_add/N:512/C:8 16304 9000 1.81155556
tensor_add/N:512/C:16 29797 15173 1.96381731
tensor_add/N:512/C:32 82945 53583 1.5479723
tensor_add/N:512/C:64 163287 107626 1.51717057
tensor_add/N:512/C:128 207900 145849 1.42544687
tensor_add/N:512/C:256 284782 233692 1.21862109
tensor_add/N:512/C:512 798740 529302 1.50904399

Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Jul 10, 2020
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 32994b9
Pull Request resolved: #41267
@dr-ci
Copy link

dr-ci bot commented Jul 11, 2020

💊 CI failures summary and remediations

As of commit 568d9d6 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



1 failure confirmed as flaky and can be ignored:

  • pytorch_linux_xenial_py3_6_gcc5_4_test

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 16 times.


template<>
inline float32x4_t fill_mask<0, true>(float32x4_t mask){
static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a constexpr instead of static. If not it could be const whether you decide to have it as static or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I tried that, but I dont think it worked for the intrinsics datatype.

template<>
inline float32x4_t fill_mask<0, true>(float32x4_t mask){
static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0};
float32x4_t tmp_mask = vreinterpretq_f32_u32(int_mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you managed to make the above constexpr without running into compilation issues, you might be able to make this constexpr to push the computation to compile-time since in theory the mask is completely known at compile-time.

Same two comments for similar cases below as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thats what I anted to do, but that did not work. I am also surprised that I have vreinterpretq_f32_u32 left here. Need to fix this, else compiler seg faults. Although in this case it probably did not, it seems.

"vorr %[in_mask], %[in_mask], %[in_tmp]\n\t"
: [in_mask] "+w" (mask)
: [in_tmp] "w" (tmp_mask)
: );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind is to add "cc" to the clobber list if the instruction modifies the flag register. You have to check the reference manual for all of the instructions you are using as inline assembly in this PR to see if that's the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

return 8;
}
Vec256() {}
Vec256(float32x4x2_t v) : values(v) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally mark as const all parameters that do not change. :) Making mistakes is easy and I think it's good practice to program defensively.

{
Vec256<float> vec;
static uint32x4_t mask_low = {0xFFFFFFFF, 0x0, 0x0, 0x0};
vec.values.val[0] = (float32x4_t)mask_low;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me what this does is take UINT_MAX (i.e. 0xFFFFFFFF == 4294967295) and store the equivalent IEEE-754 FP32 bit pattern that's closest to representing 4294967295.0f (i.e. 0x4F800000) into val[0]. Is that what you intended? If you wanted the 0xFFFFFFFF bit pattern you should either use vreinterpretq_f32_u32 or *(const float32x4_t*)(&mask_low).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the type is uint32x4_t I am not sure how meaningful it is to cast to a different pointer type. On the other hand use of intrinsics was resulting in seg fault. I just look the intrinsic up in arm_neon.h and it was doing just (float32x4_t), so I replaced that with it. Although not sure why the similar use of intrinsic elsewhere did not cause problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.

__asm__ __volatile__ (
"vmov.f32 q3, %[in_low]\n\t"
"vmov.f32 q4, %[in_high]\n\t"
"vst1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can align tmp_values and use the @align property of store for a slightly faster execution.

:
: [in_ptr] "r" (tmp_values),
[in_low] "w" (values.val[0]), [in_high] "w" (values.val[1])
: "q3", "q4");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again "memory" here and everywhere else you are doing a read or write memory access."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes ordering requirement stringent and prevents compiler from doing any optimization, as that makes it look like a barrier, I think. Hence I did not do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is optional. I was under the impression that it is required for correct behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi, so this one seems to suggest that it prevents compiler from reordering loads/stores before "asm". https://stackoverflow.com/questions/14449141/the-difference-between-asm-asm-volatile-and-clobbering-memory. Basically it makes it assume that you dont know what memory gets written inside "asm" so dont reorder load/stores. But since address registers are linked in the asm I assume compiler understands what gets written? I was trying not to put constraint on compiler optimization.
Seems like modifying this to "rm" should be ok: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi, doesnt seem that I can get "rm" to work so just added "memory" to the clobbers list as you suggested.

return Vec256<float>(low, high);
}
else {
__at_align32__ float tmp_values[size()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it seems it is how PyTorch prefers to do it probably a practice from before alignas was added to the language.

// this should be removed. TODO (kimishpatel)
const float operator[](int idx) const {
__at_align32__ float tmp[size()];
store(tmp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a store() and a loadu(). You can consider adding aligned versions of these functions for internal use in this class, and then use them where you have control over the alignment of data like here. Just a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know for sure if it makes perf difference. It probably avoids address alignment check, but my understanding is that if the address is aligned there is no perf penalty but if it is not then there is one cycle penalty. In x86 if you use aligned vs of load with unaligned address you get exception, not sure if thats the case here but either way it not known to me if this introduces performance penalty.

Vec256<float> operator==(const Vec256<float>& other) const {
float32x4_t res0, res1;
__asm__ (
"vceq.f32 %[res0], %[in_0], %[other_in_0]\n\t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this modify the flags register. I think not, but if it does you need to add "cc" to the clobber list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check. Thanks for this catch.

Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Jul 20, 2020
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: b415552
Pull Request resolved: #41267
template<>
inline float32x4_t fill_mask<0, true>(float32x4_t mask){
static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0};
float32x4_t tmp_mask = (float32x4_t)(int_mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure this is the correct portable usage. The fact that the compiler crashes on a use of vreinterpretq_f32_s32 and accepts (float32x4_t) does not necessarily mean that one is logically correct and the other is not. There are places online that have uses of vreinterpret_f32_u32, for instance here: https://skia.googlesource.com/skia/+/refs/heads/chrome/m54/src/opts/SkNx_neon.h#57, and logically speaking your usage looks like a reinterpret_cast and I am puzzled what the underlying issue is.

It seems that arm_neon implementations of vreinterpretq_f32_s8 on Clang indeed does use a C style cast, but GCC is not. For instance take a look arm_neon.h here, or here both of which seem to be GCC's implementation that is using an internal compiler intrinsic so this approach is at least not portable even if from a practical standpoint both iOS and recent versions of Android use Clang.

Do we have any tests covering this code path? Have those tests run and passed on Android and iPhone (considering that our CI doesn't run any tests on a phone)?

What do you think @dreiss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the compiler crashes on a use of vreinterpretq_f32_s32 and accepts (float32x4_t) does not necessarily mean that one is logically correct and the other is not.

I agree, but that is not the argument here. Reason for crash is that there is some interaction between using intrinsics and optimization path in 32bit compilation.
android/ndk#1248
https://bugs.llvm.org/show_bug.cgi?id=45824

I tried infact to keep only that inline assembly that was needed and tried to use intrinsics everywhere else. However this is not about which intrinsics are ok or not. Depending on the optimization path compiler takes it crashes or does not. So I cannot pick and chose where to use intrinsics and where not.

It seems that arm_neon implementations of vreinterpretq_f32_s8 on Clang indeed does use a C style cast, but GCC is not. For instance take a look arm_neon.h here, or here both of which seem to be GCC's implementation that is using an internal compiler intrinsic so this approach is at least not portable even if from a practical standpoint both iOS and recent versions of Android use Clang.

That is a fair point and we can do internal runs and test this with GCC to if that is sufficient.

But without that, either we put compiler specific ifdefs and enable this only for clang or not do 32bit arm neon backend at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's still possible to build PyTorch for ARM with gcc (on Raspberry Pi, for example), so we should keep that working. I think a compiler-specific ifdef with a note about why is fine.

}
static Vec256<float> loadu(const void* ptr, int64_t count = size()) {
if (count == size()) {
volatile register float32x4_t low asm("q3");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well what this means is that we are only postponing this issue since whoever will port PyTorch to C++-17 in a year or two will have to deal with this. :) Anyhow I am OK with that if you and David are.

volatile register float32x4_t low asm("q3");
volatile register float32x4_t high asm("q4");
__asm__ __volatile__ (
"vld1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is microarchitecture dependent but generally there is a performance benefit to aligned loads and stores.

You can also specify an alignment for the pointer passed in Rn, using the optional : parameter, which often speeds up memory accesses.

Reference: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-1-load-and-stores

You can see that your particular use (i.e. VLDn {Dd, D(d+1), D(d+2), D(d+3)}, Rn) can accept three different, increasingly more strict, alignments in the table on VLDn's reference page.

@AshkanAliabadi
Copy link
Contributor

LGTM except for the two issues. It would be great to see what David thinks on it.

@kimishpatel
Copy link
Contributor Author

Well what this means is that we are only postponing this issue since whoever will port PyTorch to C++-17 in a year or two will have to deal with this. :) Anyhow I am OK with that if you and David are.

@AshkanAliabadi, so hopefully by then compiler issue is resolved and then we can remove the inline assembly path completely. If you have a better suggestion that avoids moves in inline assembly, I would be happy to do that way. I could not think of any.

Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Jul 22, 2020
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 0681cda
Pull Request resolved: #41267
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196)

Benchmark results:

  | Time (ns) Baseline | Time (ns) 32 bit arm neon backend vec256 | Speedup
-- | -- | -- | --
tensor_add/N:8/C:8 | 2655 | 2567 | 1.03428126
tensor_add/N:8/C:16 | 2805 | 2630 | 1.06653992
tensor_add/N:8/C:32 | 3303 | 2814 | 1.17377399
tensor_add/N:8/C:64 | 4152 | 3156 | 1.31558935
tensor_add/N:8/C:128 | 5761 | 3767 | 1.52933369
tensor_add/N:8/C:256 | 9234 | 5610 | 1.6459893
tensor_add/N:8/C:512 | 15685 | 8952 | 1.75212243
tensor_add/N:16/C:8 | 2905 | 2630 | 1.10456274
tensor_add/N:16/C:16 | 3428 | 2713 | 1.26354589
tensor_add/N:16/C:32 | 3958 | 3098 | 1.27759845
tensor_add/N:16/C:64 | 5548 | 3864 | 1.43581781
tensor_add/N:16/C:128 | 9343 | 5540 | 1.68646209
tensor_add/N:16/C:256 | 16239 | 9274 | 1.75102437
tensor_add/N:16/C:512 | 29663 | 14772 | 2.00805578
tensor_add/N:32/C:8 | 3218 | 2721 | 1.18265344
tensor_add/N:32/C:16 | 4006 | 3178 | 1.26054122
tensor_add/N:32/C:32 | 5576 | 3921 | 1.4220862
tensor_add/N:32/C:64 | 9263 | 5614 | 1.64998219
tensor_add/N:32/C:128 | 16464 | 8593 | 1.91597812
tensor_add/N:32/C:256 | 29613 | 14656 | 2.02053766
tensor_add/N:32/C:512 | 82981 | 56478 | 1.46926237
tensor_add/N:64/C:8 | 4078 | 3188 | 1.27917189
tensor_add/N:64/C:16 | 5780 | 3935 | 1.46886912
tensor_add/N:64/C:32 | 9367 | 5382 | 1.74043107
tensor_add/N:64/C:64 | 16069 | 8658 | 1.85597136
tensor_add/N:64/C:128 | 28780 | 15342 | 1.87589623
tensor_add/N:64/C:256 | 80601 | 57625 | 1.39871584
tensor_add/N:64/C:512 | 172897 | 100800 | 1.71524802
tensor_add/N:128/C:8 | 5771 | 3800 | 1.51868421
tensor_add/N:128/C:16 | 9408 | 5496 | 1.71179039
tensor_add/N:128/C:32 | 15836 | 8735 | 1.81293646
tensor_add/N:128/C:64 | 29315 | 15229 | 1.92494583
tensor_add/N:128/C:128 | 88595 | 51456 | 1.72176228
tensor_add/N:128/C:256 | 160211 | 101824 | 1.57341098
tensor_add/N:128/C:512 | 218426 | 165128 | 1.32276779
tensor_add/N:256/C:8 | 9194 | 5538 | 1.66016612
tensor_add/N:256/C:16 | 15659 | 9036 | 1.73295706
tensor_add/N:256/C:32 | 29465 | 15246 | 1.93263807
tensor_add/N:256/C:64 | 85880 | 50658 | 1.69528998
tensor_add/N:256/C:128 | 160866 | 102213 | 1.57383112
tensor_add/N:256/C:256 | 236656 | 161019 | 1.4697396
tensor_add/N:256/C:512 | 268035 | 236729 | 1.13224404
tensor_add/N:512/C:8 | 16304 | 9000 | 1.81155556
tensor_add/N:512/C:16 | 29797 | 15173 | 1.96381731
tensor_add/N:512/C:32 | 82945 | 53583 | 1.5479723
tensor_add/N:512/C:64 | 163287 | 107626 | 1.51717057
tensor_add/N:512/C:128 | 207900 | 145849 | 1.42544687
tensor_add/N:512/C:256 | 284782 | 233692 | 1.21862109
tensor_add/N:512/C:512 | 798740 | 529302 | 1.50904399



[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Jul 22, 2020
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: dee3748
Pull Request resolved: #41267
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196)

Benchmark results:

  | Time (ns) Baseline | Time (ns) 32 bit arm neon backend vec256 | Speedup
-- | -- | -- | --
tensor_add/N:8/C:8 | 2655 | 2567 | 1.03428126
tensor_add/N:8/C:16 | 2805 | 2630 | 1.06653992
tensor_add/N:8/C:32 | 3303 | 2814 | 1.17377399
tensor_add/N:8/C:64 | 4152 | 3156 | 1.31558935
tensor_add/N:8/C:128 | 5761 | 3767 | 1.52933369
tensor_add/N:8/C:256 | 9234 | 5610 | 1.6459893
tensor_add/N:8/C:512 | 15685 | 8952 | 1.75212243
tensor_add/N:16/C:8 | 2905 | 2630 | 1.10456274
tensor_add/N:16/C:16 | 3428 | 2713 | 1.26354589
tensor_add/N:16/C:32 | 3958 | 3098 | 1.27759845
tensor_add/N:16/C:64 | 5548 | 3864 | 1.43581781
tensor_add/N:16/C:128 | 9343 | 5540 | 1.68646209
tensor_add/N:16/C:256 | 16239 | 9274 | 1.75102437
tensor_add/N:16/C:512 | 29663 | 14772 | 2.00805578
tensor_add/N:32/C:8 | 3218 | 2721 | 1.18265344
tensor_add/N:32/C:16 | 4006 | 3178 | 1.26054122
tensor_add/N:32/C:32 | 5576 | 3921 | 1.4220862
tensor_add/N:32/C:64 | 9263 | 5614 | 1.64998219
tensor_add/N:32/C:128 | 16464 | 8593 | 1.91597812
tensor_add/N:32/C:256 | 29613 | 14656 | 2.02053766
tensor_add/N:32/C:512 | 82981 | 56478 | 1.46926237
tensor_add/N:64/C:8 | 4078 | 3188 | 1.27917189
tensor_add/N:64/C:16 | 5780 | 3935 | 1.46886912
tensor_add/N:64/C:32 | 9367 | 5382 | 1.74043107
tensor_add/N:64/C:64 | 16069 | 8658 | 1.85597136
tensor_add/N:64/C:128 | 28780 | 15342 | 1.87589623
tensor_add/N:64/C:256 | 80601 | 57625 | 1.39871584
tensor_add/N:64/C:512 | 172897 | 100800 | 1.71524802
tensor_add/N:128/C:8 | 5771 | 3800 | 1.51868421
tensor_add/N:128/C:16 | 9408 | 5496 | 1.71179039
tensor_add/N:128/C:32 | 15836 | 8735 | 1.81293646
tensor_add/N:128/C:64 | 29315 | 15229 | 1.92494583
tensor_add/N:128/C:128 | 88595 | 51456 | 1.72176228
tensor_add/N:128/C:256 | 160211 | 101824 | 1.57341098
tensor_add/N:128/C:512 | 218426 | 165128 | 1.32276779
tensor_add/N:256/C:8 | 9194 | 5538 | 1.66016612
tensor_add/N:256/C:16 | 15659 | 9036 | 1.73295706
tensor_add/N:256/C:32 | 29465 | 15246 | 1.93263807
tensor_add/N:256/C:64 | 85880 | 50658 | 1.69528998
tensor_add/N:256/C:128 | 160866 | 102213 | 1.57383112
tensor_add/N:256/C:256 | 236656 | 161019 | 1.4697396
tensor_add/N:256/C:512 | 268035 | 236729 | 1.13224404
tensor_add/N:512/C:8 | 16304 | 9000 | 1.81155556
tensor_add/N:512/C:16 | 29797 | 15173 | 1.96381731
tensor_add/N:512/C:32 | 82945 | 53583 | 1.5479723
tensor_add/N:512/C:64 | 163287 | 107626 | 1.51717057
tensor_add/N:512/C:128 | 207900 | 145849 | 1.42544687
tensor_add/N:512/C:256 | 284782 | 233692 | 1.21862109
tensor_add/N:512/C:512 | 798740 | 529302 | 1.50904399



[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Jul 23, 2020
Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 803507f
Pull Request resolved: #41267
Copy link
Contributor

@AshkanAliabadi AshkanAliabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Kimish.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in dede71d.

@facebook-github-bot facebook-github-bot deleted the gh/kimishpatel/34/head branch July 28, 2020 14:18
volatile register float32x4_t low asm("q3");
volatile register float32x4_t high asm("q4");
__asm__ __volatile__ (
"vld1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try to use vld1.32 {%S0 - %V0} instead of naming registers d6 to d9?
Unfortunately this will work only with gcc, and not with clang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants