Support aarch32 neon backend for Vec256 #41267

kimishpatel · 2020-07-10T18:53:03Z

Stack from ghstack:

Call qnnpack's conv setup only if input pointer has changed. #42008 Call qnnpack's conv setup only if input pointer has changed.
Refactor qconv to reduce allocations. #42007 Refactor qconv to reduce allocations.
Simple caching allocator for CPU. #42006 Simple caching allocator for CPU.
Support aarch32 neon backend for Vec256 #41267 Support aarch32 neon backend for Vec256

Summary:
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D22482196

Benchmark results:

	Time (ns) Baseline	Time (ns) 32 bit arm neon backend vec256	Speedup
tensor_add/N:8/C:8	2655	2567	1.03428126
tensor_add/N:8/C:16	2805	2630	1.06653992
tensor_add/N:8/C:32	3303	2814	1.17377399
tensor_add/N:8/C:64	4152	3156	1.31558935
tensor_add/N:8/C:128	5761	3767	1.52933369
tensor_add/N:8/C:256	9234	5610	1.6459893
tensor_add/N:8/C:512	15685	8952	1.75212243
tensor_add/N:16/C:8	2905	2630	1.10456274
tensor_add/N:16/C:16	3428	2713	1.26354589
tensor_add/N:16/C:32	3958	3098	1.27759845
tensor_add/N:16/C:64	5548	3864	1.43581781
tensor_add/N:16/C:128	9343	5540	1.68646209
tensor_add/N:16/C:256	16239	9274	1.75102437
tensor_add/N:16/C:512	29663	14772	2.00805578
tensor_add/N:32/C:8	3218	2721	1.18265344
tensor_add/N:32/C:16	4006	3178	1.26054122
tensor_add/N:32/C:32	5576	3921	1.4220862
tensor_add/N:32/C:64	9263	5614	1.64998219
tensor_add/N:32/C:128	16464	8593	1.91597812
tensor_add/N:32/C:256	29613	14656	2.02053766
tensor_add/N:32/C:512	82981	56478	1.46926237
tensor_add/N:64/C:8	4078	3188	1.27917189
tensor_add/N:64/C:16	5780	3935	1.46886912
tensor_add/N:64/C:32	9367	5382	1.74043107
tensor_add/N:64/C:64	16069	8658	1.85597136
tensor_add/N:64/C:128	28780	15342	1.87589623
tensor_add/N:64/C:256	80601	57625	1.39871584
tensor_add/N:64/C:512	172897	100800	1.71524802
tensor_add/N:128/C:8	5771	3800	1.51868421
tensor_add/N:128/C:16	9408	5496	1.71179039
tensor_add/N:128/C:32	15836	8735	1.81293646
tensor_add/N:128/C:64	29315	15229	1.92494583
tensor_add/N:128/C:128	88595	51456	1.72176228
tensor_add/N:128/C:256	160211	101824	1.57341098
tensor_add/N:128/C:512	218426	165128	1.32276779
tensor_add/N:256/C:8	9194	5538	1.66016612
tensor_add/N:256/C:16	15659	9036	1.73295706
tensor_add/N:256/C:32	29465	15246	1.93263807
tensor_add/N:256/C:64	85880	50658	1.69528998
tensor_add/N:256/C:128	160866	102213	1.57383112
tensor_add/N:256/C:256	236656	161019	1.4697396
tensor_add/N:256/C:512	268035	236729	1.13224404
tensor_add/N:512/C:8	16304	9000	1.81155556
tensor_add/N:512/C:16	29797	15173	1.96381731
tensor_add/N:512/C:32	82945	53583	1.5479723
tensor_add/N:512/C:64	163287	107626	1.51717057
tensor_add/N:512/C:128	207900	145849	1.42544687
tensor_add/N:512/C:256	284782	233692	1.21862109
tensor_add/N:512/C:512	798740	529302	1.50904399

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 32994b9 Pull Request resolved: #41267

dr-ci · 2020-07-11T01:42:08Z

💊 CI failures summary and remediations

As of commit 568d9d6 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

2/2 broken upstream at merge base aa91a65 on Jul 22 from 11:42am to 1:22pm PDT (3 commits; 9c7ca89 - b80ffd4)

1 failure confirmed as flaky and can be ignored:

pytorch_linux_xenial_py3_6_gcc5_4_test

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build on Jul 22 from 11:42am to 1:22pm PDT (3 commits; 9c7ca89 - b80ffd4)
- 🔁 rerun
pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build on Jul 22 from 11:42am to 1:22pm PDT (3 commits; 9c7ca89 - b80ffd4)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 16 times.

AshkanAliabadi · 2020-07-15T18:40:01Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+
+template<>
+inline float32x4_t fill_mask<0, true>(float32x4_t mask){
+  static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0};


Can this be a constexpr instead of static. If not it could be const whether you decide to have it as static or not.

I think I tried that, but I dont think it worked for the intrinsics datatype.

AshkanAliabadi · 2020-07-15T18:45:04Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+template<>
+inline float32x4_t fill_mask<0, true>(float32x4_t mask){
+  static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0};
+  float32x4_t tmp_mask = vreinterpretq_f32_u32(int_mask);


If you managed to make the above constexpr without running into compilation issues, you might be able to make this constexpr to push the computation to compile-time since in theory the mask is completely known at compile-time.

Same two comments for similar cases below as well.

Yes. Thats what I anted to do, but that did not work. I am also surprised that I have vreinterpretq_f32_u32 left here. Need to fix this, else compiler seg faults. Although in this case it probably did not, it seems.

AshkanAliabadi · 2020-07-15T18:54:15Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+      "vorr %[in_mask], %[in_mask], %[in_tmp]\n\t"
+      : [in_mask] "+w" (mask)
+      : [in_tmp] "w" (tmp_mask)
+      : );


One thing to keep in mind is to add "cc" to the clobber list if the instruction modifies the flag register. You have to check the reference manual for all of the instructions you are using as inline assembly in this PR to see if that's the case.

Good point.

AshkanAliabadi · 2020-07-15T18:58:51Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+    return 8;
+  }
+  Vec256() {}
+  Vec256(float32x4x2_t v) : values(v) {}


I would personally mark as const all parameters that do not change. :) Making mistakes is easy and I think it's good practice to program defensively.

AshkanAliabadi · 2020-07-15T19:15:29Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+        {
+          Vec256<float> vec;
+          static uint32x4_t mask_low = {0xFFFFFFFF, 0x0, 0x0, 0x0};
+          vec.values.val[0] = (float32x4_t)mask_low;


It seems to me what this does is take UINT_MAX (i.e. 0xFFFFFFFF == 4294967295) and store the equivalent IEEE-754 FP32 bit pattern that's closest to representing 4294967295.0f (i.e. 0x4F800000) into val[0]. Is that what you intended? If you wanted the 0xFFFFFFFF bit pattern you should either use vreinterpretq_f32_u32 or *(const float32x4_t*)(&mask_low).

Since the type is uint32x4_t I am not sure how meaningful it is to cast to a different pointer type. On the other hand use of intrinsics was resulting in seg fault. I just look the intrinsic up in arm_neon.h and it was doing just (float32x4_t), so I replaced that with it. Although not sure why the similar use of intrinsic elsewhere did not cause problem.

Interesting.

AshkanAliabadi · 2020-07-15T19:50:21Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+      __asm__  __volatile__ (
+          "vmov.f32 q3, %[in_low]\n\t"
+          "vmov.f32 q4, %[in_high]\n\t"
+          "vst1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t"


You can align tmp_values and use the @align property of store for a slightly faster execution.

AshkanAliabadi · 2020-07-15T19:50:48Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+          :
+          : [in_ptr] "r" (tmp_values),
+            [in_low] "w" (values.val[0]), [in_high] "w" (values.val[1])
+          : "q3", "q4");


Again "memory" here and everywhere else you are doing a read or write memory access."

I think it makes ordering requirement stringent and prevents compiler from doing any optimization, as that makes it look like a barrier, I think. Hence I did not do it.

I am not sure it is optional. I was under the impression that it is required for correct behavior.

@AshkanAliabadi, so this one seems to suggest that it prevents compiler from reordering loads/stores before "asm". https://stackoverflow.com/questions/14449141/the-difference-between-asm-asm-volatile-and-clobbering-memory. Basically it makes it assume that you dont know what memory gets written inside "asm" so dont reorder load/stores. But since address registers are linked in the asm I assume compiler understands what gets written? I was trying not to put constraint on compiler optimization.
Seems like modifying this to "rm" should be ok: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

@AshkanAliabadi, doesnt seem that I can get "rm" to work so just added "memory" to the clobbers list as you suggested.

AshkanAliabadi · 2020-07-15T19:51:54Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+      return Vec256<float>(low, high);
+    }
+    else {
+      __at_align32__ float tmp_values[size()];


asignas(32)

OK it seems it is how PyTorch prefers to do it probably a practice from before alignas was added to the language.

AshkanAliabadi · 2020-07-15T19:56:03Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+  // this should be removed. TODO (kimishpatel)
+  const float operator[](int idx) const {
+    __at_align32__ float tmp[size()];
+    store(tmp);


We have a store() and a loadu(). You can consider adding aligned versions of these functions for internal use in this class, and then use them where you have control over the alignment of data like here. Just a suggestion.

I dont know for sure if it makes perf difference. It probably avoids address alignment check, but my understanding is that if the address is aligned there is no perf penalty but if it is not then there is one cycle penalty. In x86 if you use aligned vs of load with unaligned address you get exception, not sure if thats the case here but either way it not known to me if this introduces performance penalty.

AshkanAliabadi · 2020-07-15T19:58:59Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+  Vec256<float> operator==(const Vec256<float>& other) const {
+    float32x4_t res0, res1;
+    __asm__ (
+        "vceq.f32 %[res0], %[in_0], %[other_in_0]\n\t"


Does this modify the flags register. I think not, but if it does you need to add "cc" to the clobber list.

I will check. Thanks for this catch.

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) [ghstack-poisoned]

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b415552 Pull Request resolved: #41267

AshkanAliabadi · 2020-07-22T16:38:40Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+template<>
+inline float32x4_t fill_mask<0, true>(float32x4_t mask){
+  static uint32x4_t int_mask = {0xFFFFFFFF, 0x0, 0x0, 0x0};
+  float32x4_t tmp_mask = (float32x4_t)(int_mask);


I am still not sure this is the correct portable usage. The fact that the compiler crashes on a use of vreinterpretq_f32_s32 and accepts (float32x4_t) does not necessarily mean that one is logically correct and the other is not. There are places online that have uses of vreinterpret_f32_u32, for instance here: https://skia.googlesource.com/skia/+/refs/heads/chrome/m54/src/opts/SkNx_neon.h#57, and logically speaking your usage looks like a reinterpret_cast and I am puzzled what the underlying issue is.

It seems that arm_neon implementations of vreinterpretq_f32_s8 on Clang indeed does use a C style cast, but GCC is not. For instance take a look arm_neon.h here, or here both of which seem to be GCC's implementation that is using an internal compiler intrinsic so this approach is at least not portable even if from a practical standpoint both iOS and recent versions of Android use Clang.

Do we have any tests covering this code path? Have those tests run and passed on Android and iPhone (considering that our CI doesn't run any tests on a phone)?

What do you think @dreiss?

The fact that the compiler crashes on a use of vreinterpretq_f32_s32 and accepts (float32x4_t) does not necessarily mean that one is logically correct and the other is not.

I agree, but that is not the argument here. Reason for crash is that there is some interaction between using intrinsics and optimization path in 32bit compilation.
android/ndk#1248
https://bugs.llvm.org/show_bug.cgi?id=45824

I tried infact to keep only that inline assembly that was needed and tried to use intrinsics everywhere else. However this is not about which intrinsics are ok or not. Depending on the optimization path compiler takes it crashes or does not. So I cannot pick and chose where to use intrinsics and where not.

It seems that arm_neon implementations of vreinterpretq_f32_s8 on Clang indeed does use a C style cast, but GCC is not. For instance take a look arm_neon.h here, or here both of which seem to be GCC's implementation that is using an internal compiler intrinsic so this approach is at least not portable even if from a practical standpoint both iOS and recent versions of Android use Clang.

That is a fair point and we can do internal runs and test this with GCC to if that is sufficient.

But without that, either we put compiler specific ifdefs and enable this only for clang or not do 32bit arm neon backend at all.

Yeah, I think it's still possible to build PyTorch for ARM with gcc (on Raspberry Pi, for example), so we should keep that working. I think a compiler-specific ifdef with a note about why is fine.

AshkanAliabadi · 2020-07-22T17:21:00Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+  }
+  static Vec256<float> loadu(const void* ptr, int64_t count = size()) {
+    if (count == size()) {
+      volatile register float32x4_t low asm("q3");


Well what this means is that we are only postponing this issue since whoever will port PyTorch to C++-17 in a year or two will have to deal with this. :) Anyhow I am OK with that if you and David are.

AshkanAliabadi · 2020-07-22T17:35:00Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+      volatile register float32x4_t low asm("q3");
+      volatile register float32x4_t high asm("q4");
+      __asm__  __volatile__ (
+          "vld1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t"


This is microarchitecture dependent but generally there is a performance benefit to aligned loads and stores.

You can also specify an alignment for the pointer passed in Rn, using the optional : parameter, which often speeds up memory accesses.

Reference: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-1-load-and-stores

You can see that your particular use (i.e. VLDn {Dd, D(d+1), D(d+2), D(d+3)}, Rn) can accept three different, increasingly more strict, alignments in the table on VLDn's reference page.

AshkanAliabadi · 2020-07-22T17:42:37Z

LGTM except for the two issues. It would be great to see what David thinks on it.

kimishpatel · 2020-07-22T17:53:20Z

Well what this means is that we are only postponing this issue since whoever will port PyTorch to C++-17 in a year or two will have to deal with this. :) Anyhow I am OK with that if you and David are.

@AshkanAliabadi, so hopefully by then compiler issue is resolved and then we can remove the inline assembly path completely. If you have a better suggestion that avoids moves in inline assembly, I would be happy to do that way. I could not think of any.

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) [ghstack-poisoned]

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0681cda Pull Request resolved: #41267

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) Benchmark results: | Time (ns) Baseline | Time (ns) 32 bit arm neon backend vec256 | Speedup -- | -- | -- | -- tensor_add/N:8/C:8 | 2655 | 2567 | 1.03428126 tensor_add/N:8/C:16 | 2805 | 2630 | 1.06653992 tensor_add/N:8/C:32 | 3303 | 2814 | 1.17377399 tensor_add/N:8/C:64 | 4152 | 3156 | 1.31558935 tensor_add/N:8/C:128 | 5761 | 3767 | 1.52933369 tensor_add/N:8/C:256 | 9234 | 5610 | 1.6459893 tensor_add/N:8/C:512 | 15685 | 8952 | 1.75212243 tensor_add/N:16/C:8 | 2905 | 2630 | 1.10456274 tensor_add/N:16/C:16 | 3428 | 2713 | 1.26354589 tensor_add/N:16/C:32 | 3958 | 3098 | 1.27759845 tensor_add/N:16/C:64 | 5548 | 3864 | 1.43581781 tensor_add/N:16/C:128 | 9343 | 5540 | 1.68646209 tensor_add/N:16/C:256 | 16239 | 9274 | 1.75102437 tensor_add/N:16/C:512 | 29663 | 14772 | 2.00805578 tensor_add/N:32/C:8 | 3218 | 2721 | 1.18265344 tensor_add/N:32/C:16 | 4006 | 3178 | 1.26054122 tensor_add/N:32/C:32 | 5576 | 3921 | 1.4220862 tensor_add/N:32/C:64 | 9263 | 5614 | 1.64998219 tensor_add/N:32/C:128 | 16464 | 8593 | 1.91597812 tensor_add/N:32/C:256 | 29613 | 14656 | 2.02053766 tensor_add/N:32/C:512 | 82981 | 56478 | 1.46926237 tensor_add/N:64/C:8 | 4078 | 3188 | 1.27917189 tensor_add/N:64/C:16 | 5780 | 3935 | 1.46886912 tensor_add/N:64/C:32 | 9367 | 5382 | 1.74043107 tensor_add/N:64/C:64 | 16069 | 8658 | 1.85597136 tensor_add/N:64/C:128 | 28780 | 15342 | 1.87589623 tensor_add/N:64/C:256 | 80601 | 57625 | 1.39871584 tensor_add/N:64/C:512 | 172897 | 100800 | 1.71524802 tensor_add/N:128/C:8 | 5771 | 3800 | 1.51868421 tensor_add/N:128/C:16 | 9408 | 5496 | 1.71179039 tensor_add/N:128/C:32 | 15836 | 8735 | 1.81293646 tensor_add/N:128/C:64 | 29315 | 15229 | 1.92494583 tensor_add/N:128/C:128 | 88595 | 51456 | 1.72176228 tensor_add/N:128/C:256 | 160211 | 101824 | 1.57341098 tensor_add/N:128/C:512 | 218426 | 165128 | 1.32276779 tensor_add/N:256/C:8 | 9194 | 5538 | 1.66016612 tensor_add/N:256/C:16 | 15659 | 9036 | 1.73295706 tensor_add/N:256/C:32 | 29465 | 15246 | 1.93263807 tensor_add/N:256/C:64 | 85880 | 50658 | 1.69528998 tensor_add/N:256/C:128 | 160866 | 102213 | 1.57383112 tensor_add/N:256/C:256 | 236656 | 161019 | 1.4697396 tensor_add/N:256/C:512 | 268035 | 236729 | 1.13224404 tensor_add/N:512/C:8 | 16304 | 9000 | 1.81155556 tensor_add/N:512/C:16 | 29797 | 15173 | 1.96381731 tensor_add/N:512/C:32 | 82945 | 53583 | 1.5479723 tensor_add/N:512/C:64 | 163287 | 107626 | 1.51717057 tensor_add/N:512/C:128 | 207900 | 145849 | 1.42544687 tensor_add/N:512/C:256 | 284782 | 233692 | 1.21862109 tensor_add/N:512/C:512 | 798740 | 529302 | 1.50904399 [ghstack-poisoned]

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: dee3748 Pull Request resolved: #41267

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22482196](https://our.internmc.facebook.com/intern/diff/D22482196) Benchmark results: | Time (ns) Baseline | Time (ns) 32 bit arm neon backend vec256 | Speedup -- | -- | -- | -- tensor_add/N:8/C:8 | 2655 | 2567 | 1.03428126 tensor_add/N:8/C:16 | 2805 | 2630 | 1.06653992 tensor_add/N:8/C:32 | 3303 | 2814 | 1.17377399 tensor_add/N:8/C:64 | 4152 | 3156 | 1.31558935 tensor_add/N:8/C:128 | 5761 | 3767 | 1.52933369 tensor_add/N:8/C:256 | 9234 | 5610 | 1.6459893 tensor_add/N:8/C:512 | 15685 | 8952 | 1.75212243 tensor_add/N:16/C:8 | 2905 | 2630 | 1.10456274 tensor_add/N:16/C:16 | 3428 | 2713 | 1.26354589 tensor_add/N:16/C:32 | 3958 | 3098 | 1.27759845 tensor_add/N:16/C:64 | 5548 | 3864 | 1.43581781 tensor_add/N:16/C:128 | 9343 | 5540 | 1.68646209 tensor_add/N:16/C:256 | 16239 | 9274 | 1.75102437 tensor_add/N:16/C:512 | 29663 | 14772 | 2.00805578 tensor_add/N:32/C:8 | 3218 | 2721 | 1.18265344 tensor_add/N:32/C:16 | 4006 | 3178 | 1.26054122 tensor_add/N:32/C:32 | 5576 | 3921 | 1.4220862 tensor_add/N:32/C:64 | 9263 | 5614 | 1.64998219 tensor_add/N:32/C:128 | 16464 | 8593 | 1.91597812 tensor_add/N:32/C:256 | 29613 | 14656 | 2.02053766 tensor_add/N:32/C:512 | 82981 | 56478 | 1.46926237 tensor_add/N:64/C:8 | 4078 | 3188 | 1.27917189 tensor_add/N:64/C:16 | 5780 | 3935 | 1.46886912 tensor_add/N:64/C:32 | 9367 | 5382 | 1.74043107 tensor_add/N:64/C:64 | 16069 | 8658 | 1.85597136 tensor_add/N:64/C:128 | 28780 | 15342 | 1.87589623 tensor_add/N:64/C:256 | 80601 | 57625 | 1.39871584 tensor_add/N:64/C:512 | 172897 | 100800 | 1.71524802 tensor_add/N:128/C:8 | 5771 | 3800 | 1.51868421 tensor_add/N:128/C:16 | 9408 | 5496 | 1.71179039 tensor_add/N:128/C:32 | 15836 | 8735 | 1.81293646 tensor_add/N:128/C:64 | 29315 | 15229 | 1.92494583 tensor_add/N:128/C:128 | 88595 | 51456 | 1.72176228 tensor_add/N:128/C:256 | 160211 | 101824 | 1.57341098 tensor_add/N:128/C:512 | 218426 | 165128 | 1.32276779 tensor_add/N:256/C:8 | 9194 | 5538 | 1.66016612 tensor_add/N:256/C:16 | 15659 | 9036 | 1.73295706 tensor_add/N:256/C:32 | 29465 | 15246 | 1.93263807 tensor_add/N:256/C:64 | 85880 | 50658 | 1.69528998 tensor_add/N:256/C:128 | 160866 | 102213 | 1.57383112 tensor_add/N:256/C:256 | 236656 | 161019 | 1.4697396 tensor_add/N:256/C:512 | 268035 | 236729 | 1.13224404 tensor_add/N:512/C:8 | 16304 | 9000 | 1.81155556 tensor_add/N:512/C:16 | 29797 | 15173 | 1.96381731 tensor_add/N:512/C:32 | 82945 | 53583 | 1.5479723 tensor_add/N:512/C:64 | 163287 | 107626 | 1.51717057 tensor_add/N:512/C:128 | 207900 | 145849 | 1.42544687 tensor_add/N:512/C:256 | 284782 | 233692 | 1.21862109 tensor_add/N:512/C:512 | 798740 | 529302 | 1.50904399 [ghstack-poisoned]

Summary: Due to llvm bug and some unsupported intrinsics we could not directly use intrinsics for implementing aarch32 neon back end for Vec256. Instead we resort to inline assembly. Test Plan: vec256_test run on android phone. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 803507f Pull Request resolved: #41267

AshkanAliabadi

Thank you Kimish.

facebook-github-bot · 2020-07-24T23:49:23Z

This pull request has been merged in dede71d.

sebpop · 2020-09-04T17:23:44Z

aten/src/ATen/cpu/vec256/vec256_float_neon.h

+      volatile register float32x4_t low asm("q3");
+      volatile register float32x4_t high asm("q4");
+      __asm__ __volatile__ (
+          "vld1.32 {d6, d7, d8, d9}, [%[in_ptr]]\n\t"


Could you please try to use vld1.32 {%S0 - %V0} instead of naming registers d6 to d9?
Unfortunately this will work only with gcc, and not with clang.

kimishpatel requested review from AshkanAliabadi and dreiss July 10, 2020 18:53

AshkanAliabadi reviewed Jul 15, 2020

View reviewed changes

kimishpatel requested a review from AshkanAliabadi July 20, 2020 23:50

AshkanAliabadi reviewed Jul 22, 2020

View reviewed changes

AshkanAliabadi approved these changes Jul 23, 2020

View reviewed changes

This was referenced Jul 24, 2020

Simple caching allocator for CPU. #42006

Closed

Refactor qconv to reduce allocations. #42007

Closed

Call qnnpack's conv setup only if input pointer has changed. #42008

Closed

facebook-github-bot closed this in dede71d Jul 24, 2020

facebook-github-bot added the merged label Jul 24, 2020

facebook-github-bot deleted the gh/kimishpatel/34/head branch July 28, 2020 14:18

sebpop reviewed Sep 4, 2020

View reviewed changes

mruberry added the Merged label Oct 28, 2020

Support aarch32 neon backend for Vec256 #41267

Support aarch32 neon backend for Vec256 #41267

Uh oh!

Conversation

kimishpatel commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 2 fixed upstream failures:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AshkanAliabadi commented Jul 22, 2020

Uh oh!

kimishpatel commented Jul 22, 2020

Uh oh!

AshkanAliabadi left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

kimishpatel commented Jul 10, 2020 •

edited

Loading

dr-ci bot commented Jul 11, 2020 •

edited

Loading